Releases · aws/aws-sdk-pandas

11 Jan 12:32

3.5.0

37bd54d

AWS SDK for pandas 3.5.0

Breaking changes 💥

Due to CVEs, Ray is capped to patched version 2.9.x. As a result, the latest version of the library cannot be used on the Glue for Ray runtime. We have raised the CVEs issue to the Glue team

Features/Enhancements 🚀

Add spark_properties to athena spark by @rajagurunath in #2508
Add MERGE INTO support for Iceberg by @LeonLuttenberger in #2527
Support partitioning by index cols by @kukushking in #2528
Add analysis_template_arn to cleanrooms.read_sql_query by @jaidisido in #2584
Python 3.12 support by @LeonLuttenberger in #2559
- Note: Ray currently does not support Python 3.12. As such, distributed operations on data frames will not work yet.
- Relevant Ray issue
Upgrade to Ray 2.9.0+ and refactor Ray datasources to the new API by @kukushking in #2570

Bug fixes 🐛

Athena/Neptune minor fixes by @kukushking in #2526
Reset index and handle last index by @Antropath in #2531
Oracle failed import message by @matthewdeanmartin in #2537
Add parameterized queries where possible to address the risk of SQL injection by @LeonLuttenberger in #2540
SQL identifiers by @kukushking in #2543
coerce_timestamps - allow None by @kukushking in #2556
Add validation for table and schema params for Redshift by @LeonLuttenberger in #2551
Redshift VARBYTE support by @kukushking in #2573

Documentation 📚

Add SSM Public Param usage to docs by @malachi-constant in #2521

Other 🤖

refactor: Remove usage of boto3 resources by @LeonLuttenberger in #2525
chore(deps): bump aiohttp from 3.8.5 to 3.8.6 by @dependabot in #2519
chore(deps): bump aiohttp from 3.8.6 to 3.9.0 by @dependabot in #2535
chore(deps): bump cryptography from 41.0.4 to 41.0.6 by @dependabot in #2538
chore(deps-dev): bump jupyter-server from 2.7.2 to 2.11.2 by @dependabot in #2545
chore: Upgrade test infrastructure dependencies by @LeonLuttenberger in #2562
chore: Prepare 3.5.0 release by @LeonLuttenberger in #2560
chore: Upgrade deltalake dependency by @LeonLuttenberger in #2563
chore: Replace black formatter with ruff format by @LeonLuttenberger in #2568
chore: ruff improvements by @LeonLuttenberger in #2571
chore: upgrade oracledb to 2.0 by @LeonLuttenberger in #2574
chore(deps-dev): bump the development-dependencies group with 8 updates by @dependabot in #2577
chore(deps-dev): bump the development-dependencies group with 5 updates by @dependabot in #2583
chore(deps-dev): bump the development-dependencies group with 3 updates by @dependabot in #2590
chore(deps): bump the production-dependencies group with 5 updates by @dependabot in #2591
chore: type annotations by @LeonLuttenberger in #2585
chore: Replace PyLint with Ruff by @LeonLuttenberger in #2588
chore: Update gremlinpython & add aiohttp by @kukushking in #2595

New Contributors

@rajagurunath made their first contribution in #2508
@Antropath made their first contribution in #2531
@matthewdeanmartin made their first contribution in #2537

Full Changelog: 3.4.2...3.5.0

Contributors

matthewdeanmartin, kukushking, and 6 other contributors

Assets 13

13 Nov 19:02

kukushking

3.4.2

24e6e81

AWS SDK for pandas 3.4.2

Features/Enhancements 🚀

Update pyarrow to 14.0.1 to fix arbitrary code execution security vulnerability

Full Changelog: 3.4.1...3.4.2

Assets 11

24 Oct 13:19

kukushking

3.4.1

a61559a

AWS SDK for pandas 3.4.1

Features/Enhancements 🚀

feat: Add schema evolution to athena.to_iceberg by @LeonLuttenberger in #2465
feat: Athena - add client_request_token by @kukushking in #2474
feat: Redshift data api - allow all auth combinations by @kukushking in #2475
feat: add columns comments to iceberg by @frenchytheasian in #2482
feat: Add Python 3.11 layers in cn-north-1 & cn-northwest-1 by @kukushking in #2514

Bug fixes 🐛

fix: Add missing call to sanitize_column_name in create_*_table by @LeonLuttenberger in #2464
fix: Hyphenated Iceberg table names by @LeonLuttenberger in #2466
fix: requests_aws4auth not being treated as an optional dependency by @LeonLuttenberger in #2471
fix: KeyError exception in athena wrangler by @rabingaire in #2483
fix: column names and apply map by @LumberjackUsingMath in #2492
fix: Gremlin batch size calc by @kukushking in #2496

Documentation 📚

docs: Update layers.rst - add cn-north-1 & cn-northwest-1 by @kukushking in #2477

New Contributors

@rabingaire made their first contribution in #2483
@frenchytheasian made their first contribution in #2482
@LumberjackUsingMath made their first contribution in #2492

Full Changelog: 3.4.0...3.4.1

Contributors

kukushking, LeonLuttenberger, and 3 other contributors

Assets 11

11 Sep 18:35

LeonLuttenberger

3.4.0

1842da8

AWS SDK for pandas 3.4.0

Features/Enhancements 🚀

Geospatial - parse Athena geospatial types via geopandas by @kukushking in #2346
Allow group identifiers to be used in wr.cloudwatch queries by @LeonLuttenberger in #2430
Add ignore null store parquet metadata by @raaidarshad in #2450

Bug fixes 🐛

Add missing boto3 session in athena.to_iceberg wait_query by @jaidisido in #2428
Add catalog ID in athena.to_iceberg by @jaidisido in #2446
Return None for missing column and partition key comment by @robert-schmidtke in #2449
Fix urllib3 error when building AWS Lambda Layers by @LeonLuttenberger in #2447
Duplicate schema argument in wr.s3.to_parquet by @kukushking in #2455

Tests 🧪

Test dependabot groups feature by @jaidisido in #2426

New Contributors

@raaidarshad made their first contribution in #2450

Full Changelog: 3.3.0...3.4.0

Contributors

robert-schmidtke, kukushking, and 3 other contributors

Assets 11

01 Aug 19:53

jaidisido

3.3.0

1e2f940

AWS SDK for pandas 3.3.0

Features/Enhancements 🚀

Support Athena query prepared statements & Athena parameterized queries by @LeonLuttenberger in #2344
Add dtype parameter in to_iceberg function by @paulobrunheroto in #2359
Add CleanRooms read module by @jaidisido in #2366
Escape and validate table identifiers and literals in PostreSQL by @kukushking in #2390
Add Python 3.11 support by @moralesl in #2414

Bug fixes 🐛

Escape column names in PRIMARY KEY statement in SQL query by @mc51 in #2351
Remove .lower in dtype sanitize for to_parquet by @jaidisido in #2369
Enforce use_threads=False when Limit is supplied by @jaidisido in #2372
Fix Boto3 session not being passed to cleanrooms.wait_query by @LeonLuttenberger in #2381
Allow ANSI-compatible identifiers in RDS Data API by @kukushking in #2391
Pass schema to chunked parquet reads by @kukushking in #2400
Support pyarrow schema in DynamoDB read_items #2399 by @jaidisido in #2401
Upgrade Ray to 2.6 and fix security dependabots by @jaidisido in #2403
Fix Arrow timezone localization by @kukushking in #2411
Use from_arrow instead of from_arrow_refs by @jaidisido in #2417

Tests 🧪

Make minimal tests run on mac and windows by @LeonLuttenberger in #2347
Add Aurora PostgreSQL Serverless by @kukushking in #2388

New Contributors

@mc51 made their first contribution in #2351
@paulobrunheroto made their first contribution in #2359
@moralesl made their first contribution in #2414

Full Changelog: 3.2.1...3.3.0

Contributors

kukushking, LeonLuttenberger, and 4 other contributors

Assets 11

14 Jun 21:59

LeonLuttenberger

3.2.1

3dd4fa9

AWS SDK for pandas 3.2.1

Fixes 🛠️

Fix error where library could not be imported on Windows due to No module named 'pyarrow._orc' by @LeonLuttenberger in #2341 #2337
Lower packaging version requirement by @LeonLuttenberger in #2340
Allow Ray 2.5 & downgrade tox by @kukushking in #2338

Full Changelog: 3.2.0...3.2.1

Contributors

kukushking and LeonLuttenberger

Assets 3

13 Jun 00:07

kukushking

3.2.0

44891b8

AWS SDK for pandas 3.2.0

Features/Enhancements 🚀

Add s3.read_orc and s3.to_orc by @LeonLuttenberger in #2312 🔥
Apache Spark on Amazon Athena - wr.athena.create_spark_session & wr.athena.run_spark_calculation by @kukushking in #2314 🚀
EMR Serverless by @kukushking in #2304 🔥
Add to_sql for RDS Data API by @LeonLuttenberger in #2287
Add Timestream UNLOAD by @kukushking in #2284
Opensearch parallel bulk by @kukushking in #2310
Allow user groups to be passed in allowed_to_use and allowed_to_manage when creating QuickSight resources by @LeonLuttenberger in #2278
Add engine/memory_format os env variables and delay engine initialization by @jaidisido in #2285
Support reading with PyArrow-backed types by @LeonLuttenberger in #2292
Support additional parameters for Neptune bulk load by @LeonLuttenberger in #2297
Sync ray 2.4 parquet datasource by @kukushking in #2300
Timestream: Add multi measure write record example by @mandawat in #2317
Iceberg PARTITIONED BY and additional table properties support by @kukushking in #2322
Add ability to pass schema to s3.read_parquet by @kukushking in #2328

Bug fixes 🐛

Fix recurring issue with test_spectrum_decimal_cast by @LeonLuttenberger in #2283
Fix Redshift unload not escaping SQL query by @LeonLuttenberger in #2286
Fix KeyError & add lock to athena cache manager by @kukushking in #2299
Fix Neptune bulk load bad request by @LeonLuttenberger in #2305
Add AWS_REGION by default to deltalake storage_options by @jaidisido in #2315

Documentation 📚

Add page for data_api.rds.to_sql by @LeonLuttenberger in #2291

Tests 🧪

Add unit test for dtype_backend use in read_parquet_table by @LeonLuttenberger in #2307
Adapt benchmark tests to Glue for Ray GA breaking changes by @jaidisido in #2316

Refactoring 🛠️

Refactor SQL formatter by @LeonLuttenberger in #2288
Refactor engine register_func to handle type checking by @LeonLuttenberger in #2309

New Contributors

@mandawat made their first contribution in #2317

Full Changelog: 3.1.1...3.2.0

Contributors

kukushking, LeonLuttenberger, and 2 other contributors

Assets 9

16 May 00:12

LeonLuttenberger

3.1.1

965d2d0

AWS SDK for pandas 3.1.1

What's Changed

fix: Add missing packaging dependency by @LeonLuttenberger in #2281

Full Changelog: 3.1.0...3.1.1

Contributors

LeonLuttenberger

Assets 3

15 May 17:54

LeonLuttenberger

3.1.0

3696e32

AWS SDK for pandas 3.1.0

Features/Enhancements 🚀

Add neptune.bulk_load for bulk loading data into Neptune by @LeonLuttenberger in #2238 #2267
Add s3.to_deltalake function by @LeonLuttenberger in #2228
Add Timestream Batch Load support by @jaidisido in #2214
Add Iceberg insert by @kukushking in #2233
Support upsert mode for OracleDB by @LeonLuttenberger in #2265
Add chunked parameter to DynamoDB read functions by @LeonLuttenberger in #2227
Upgrade Modin to 0.20.1 & allow Ray 2.4 by @kukushking in #2234
Support Glue Connection SSM credential type by @kukushking in #2232
Add ability to pass schema to S3 Select by @kukushking in #2237
Add dynamic classification EMR config by @LLejoly in #2250
Add support for server-side cursors in PostgreSQL module by @kukushking in #2262
Add time unit to Timestream write API by @jaidisido in #2263

Fixes 🛠️

Set ignore_metadata to False by default by @jaidisido in #2206
Fix conflicting types for path_ignore_suffix by @LeonLuttenberger in #2240
Athena workgroup query engine v3 upgrade artifacts by @kukushking in #2243
Fixing test_spectrum_decimal_cast test by @LeonLuttenberger in #2244
emr.create_cluster was not passing security configuration to internal method by @malachi-constant in #2246
Fix pagination in timestream.list_tables by @SukruHan #2275

Documentation 📚

Include our ADRs in GitHub by @LeonLuttenberger in #2215 #2259
Fixes in the Athena Cache tutorial by @patrick-muller in #2201
Write ADR for the switching between PyArrow and Pandas I/O functions by @LeonLuttenberger in #2245
Fix "about" URL in README by @CGarces in #2207
Update layers.rst with Python 3.10 layers by @LeonLuttenberger in #2219
Fix links to 'Who uses library' section by @LeonLuttenberger in #2241
Declutter function overloads by extracting overloads to pyi files by @LeonLuttenberger in #2229 #2255 #2256

Full Changelog: 3.0.0...3.1.0

Contributors

kukushking, malachi-constant, and 6 other contributors

Assets 9

13 Apr 16:54

jaidisido

3.0.0

5c560ee

AWS SDK for pandas 3.0.0

Breaking changes 💥

Move dependencies to optional by @jaidisido in #1992 🔓
- Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
- The required dependencies can be easily installed with pip install awswrangler[<MODULE_NAME>], for example pip install awswrangler[redshift]
Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 💾
- For example a parameter of type dt.datetime is parsed into DATETIME xxxx-xx-xx xx:xx:xx, while a parameter of type str is formatted into "x"
Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a TypeDict by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 💼
- Glue catalog parameters are grouped together in to_parquet, to_csv and to_json
- Athena UNLOAD and CTAS parameters are grouped together
Deprecate wr.s3.merge_upsert_table by @kukushking in #2076 ⚠️
Deprecate updated_name parameter in update_ruleset by @jaidisido in #2122 ⚠️
Stop support for Python 3.7 ⚠️

New functionalities 🚀

AWS SDK for pandas can now run at scale 🚀💻🚀

Tutorials

AWS Blogs

Scale AWS SDK for pandas workloads with AWS Glue for Ray

Features/Enhancements 🚀

Thread-safety improvements by @kukushking in #2186
Allow Python 3.11 by @kukushking in #2101 🐍
Add use_theads parameter to dynamodb.read_items by @LeonLuttenberger in #2113 📈
Distribute wr.dynamodb.put_df with executor task by @LeonLuttenberger in #2118 📈
Add additional arg for glue database DatabaseInput by @malachi-constant in #2067 🔧
Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
Upgrade to Ray 2.0 by @kukushking in #1635
Add partitioning on block level by @kukushking in #1653
Use fast file metadata provider by @kukushking in #1997
Distribute DynamoDB Parallel Scan by @jaidisido in #1981
Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
Add distributed variant of the _read_parquet_metadata_file function based on the PyArrow file system by @LeonLuttenberger in #2050
Validate distributed kwargs by @kukushking in #2051
Add @Experimental and @Deprecated annotations by @kukushking in #2062
Distribute S3 describe_objects by @jaidisido in #2069
Distributed S3 copy/merge by @kukushking in #2070
Add bulk_read option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033
Deprecate boto3 resources by @kukushking in #2097
Add retries for s3 select by @kukushking in #1780
Make tqdm progress reporting opt-in by @kukushking in #1741
Distribute data types inference by @jaidisido in #1692
Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
Validate partitions along row axis, add warning by @kukushking in #1700
Refactor executor module by @kukushking in #2120
Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
Distribute Timestream write with executor by @jaidisido in #1715
Distribute s3.to_json and s3.to_csv by @LeonLuttenberger in #1631
Distribute s3.read_csv, s3.read_json and s3.read_fwf by @LeonLuttenberger in #1567 #1607
Distribute s3.wait_objects by @LeonLuttenberger in #1539
Distribute s3.to_parquet by @kukushking in #1526
Distribute s3.delete objects by @malachi-constant in #1474
Distribute s3.read_parquet by @jaidisido in #1513
Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed s3.select_query by @kukushking in #1446
Add distributed Lake Formation read by @jaidisido in #1397
Refactor ray datasources by @kukushking in #1687
Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
Add Literal typing for mode and projection_types by @LeonLuttenberger in #2191

Fixes 🛠️

Sanitize bucketing col names by @kukushking in #2155
Allow writing files from an empty dataframe by @malachi-constant in #2045
Athena out of bound dates by @kukushking in #2180
Fix partition block overwriting by @kukushking in #1695
Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
Add retries to read_parquet_metadata_distributed by @jaidisido in #2196
Fix default utcnow argument in start_query by @LeonLuttenberger in #2193

Documentation 📚

Athena Iceberg tutorial by @kukushking in #2117
Add at scale section by @kukushking in #2119
Documentation spell-checking improvements by @LeonLuttenberger in #2165
Add AWS Glue on Ray docs by @jaidisido in #1810
Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
Add typeddicts to docs by @LeonLuttenberger in #2167

Tests 🧪

Add PR linter Github action by @jaidisido in #2106
Replace load tests bucket with SSM parameter by @jaidisido in #2121
opensearch index cleanup / skip by @kukushking in #2149
Add benchmark tests by @jaidisido in #2143
Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
Remove awswrangler.distributed from coverage report by @LeonLuttenberger in #1884
Consolidate unit and load tests by @jaidisido in #1525
Distribute tests in tox config by @malachi-constant in #1469

Full Changelog: 2.20.1...3.0.0

Contributors

kukushking, Taragolis, and 4 other contributors

Assets 7

Releases: aws/aws-sdk-pandas

AWS SDK for pandas 3.5.0

Breaking changes 💥

Features/Enhancements 🚀

Bug fixes 🐛

Documentation 📚

Other 🤖

New Contributors

Contributors

AWS SDK for pandas 3.4.2

Features/Enhancements 🚀

AWS SDK for pandas 3.4.1

Features/Enhancements 🚀

Bug fixes 🐛

Documentation 📚

New Contributors

Contributors

AWS SDK for pandas 3.4.0

Features/Enhancements 🚀

Bug fixes 🐛

Tests 🧪

New Contributors

Contributors

AWS SDK for pandas 3.3.0

Features/Enhancements 🚀

Bug fixes 🐛

Tests 🧪

New Contributors

Contributors

AWS SDK for pandas 3.2.1

Fixes 🛠️

Contributors

AWS SDK for pandas 3.2.0

Features/Enhancements 🚀

Bug fixes 🐛

Documentation 📚

Tests 🧪

Refactoring 🛠️

New Contributors

Contributors

AWS SDK for pandas 3.1.1

What's Changed

Contributors

AWS SDK for pandas 3.1.0

Features/Enhancements 🚀

Fixes 🛠️

Documentation 📚

Contributors

AWS SDK for pandas 3.0.0

Breaking changes 💥

New functionalities 🚀

Tutorials

AWS Blogs

Features/Enhancements 🚀

Fixes 🛠️

Documentation 📚

Tests 🧪

Contributors