Releases: aws/aws-sdk-pandas
Releases · aws/aws-sdk-pandas
AWS SDK for pandas 3.5.0
Breaking changes 💥
Due to CVEs, Ray is capped to patched version 2.9.x. As a result, the latest version of the library cannot be used on the Glue for Ray runtime. We have raised the CVEs issue to the Glue team
Features/Enhancements 🚀
- Add
spark_properties
to athena spark by @rajagurunath in #2508 - Add
MERGE INTO
support for Iceberg by @LeonLuttenberger in #2527 - Support partitioning by index cols by @kukushking in #2528
- Add
analysis_template_arn
tocleanrooms.read_sql_query
by @jaidisido in #2584 - Python 3.12 support by @LeonLuttenberger in #2559
- Note: Ray currently does not support Python 3.12. As such, distributed operations on data frames will not work yet.
- Relevant Ray issue
- Upgrade to Ray 2.9.0+ and refactor Ray datasources to the new API by @kukushking in #2570
Bug fixes 🐛
- Athena/Neptune minor fixes by @kukushking in #2526
- Reset index and handle last index by @Antropath in #2531
- Oracle failed import message by @matthewdeanmartin in #2537
- Add parameterized queries where possible to address the risk of SQL injection by @LeonLuttenberger in #2540
- SQL identifiers by @kukushking in #2543
- coerce_timestamps - allow None by @kukushking in #2556
- Add validation for
table
andschema
params for Redshift by @LeonLuttenberger in #2551 - Redshift VARBYTE support by @kukushking in #2573
Documentation 📚
- Add SSM Public Param usage to docs by @malachi-constant in #2521
Other 🤖
- refactor: Remove usage of boto3 resources by @LeonLuttenberger in #2525
- chore(deps): bump aiohttp from 3.8.5 to 3.8.6 by @dependabot in #2519
- chore(deps): bump aiohttp from 3.8.6 to 3.9.0 by @dependabot in #2535
- chore(deps): bump cryptography from 41.0.4 to 41.0.6 by @dependabot in #2538
- chore(deps-dev): bump jupyter-server from 2.7.2 to 2.11.2 by @dependabot in #2545
- chore: Upgrade test infrastructure dependencies by @LeonLuttenberger in #2562
- chore: Prepare 3.5.0 release by @LeonLuttenberger in #2560
- chore: Upgrade deltalake dependency by @LeonLuttenberger in #2563
- chore: Replace black formatter with ruff format by @LeonLuttenberger in #2568
- chore: ruff improvements by @LeonLuttenberger in #2571
- chore: upgrade
oracledb
to 2.0 by @LeonLuttenberger in #2574 - chore(deps-dev): bump the development-dependencies group with 8 updates by @dependabot in #2577
- chore(deps-dev): bump the development-dependencies group with 5 updates by @dependabot in #2583
- chore(deps-dev): bump the development-dependencies group with 3 updates by @dependabot in #2590
- chore(deps): bump the production-dependencies group with 5 updates by @dependabot in #2591
- chore: type annotations by @LeonLuttenberger in #2585
- chore: Replace PyLint with Ruff by @LeonLuttenberger in #2588
- chore: Update gremlinpython & add aiohttp by @kukushking in #2595
New Contributors
- @rajagurunath made their first contribution in #2508
- @Antropath made their first contribution in #2531
- @matthewdeanmartin made their first contribution in #2537
Full Changelog: 3.4.2...3.5.0
AWS SDK for pandas 3.4.2
Features/Enhancements 🚀
- Update pyarrow to 14.0.1 to fix arbitrary code execution security vulnerability
Full Changelog: 3.4.1...3.4.2
AWS SDK for pandas 3.4.1
Features/Enhancements 🚀
- feat: Add schema evolution to
athena.to_iceberg
by @LeonLuttenberger in #2465 - feat: Athena - add
client_request_token
by @kukushking in #2474 - feat: Redshift data api - allow all auth combinations by @kukushking in #2475
- feat: add columns comments to iceberg by @frenchytheasian in #2482
- feat: Add Python 3.11 layers in
cn-north-1
&cn-northwest-1
by @kukushking in #2514
Bug fixes 🐛
- fix: Add missing call to
sanitize_column_name
increate_*_table
by @LeonLuttenberger in #2464 - fix: Hyphenated Iceberg table names by @LeonLuttenberger in #2466
- fix:
requests_aws4auth
not being treated as an optional dependency by @LeonLuttenberger in #2471 - fix: KeyError exception in athena wrangler by @rabingaire in #2483
- fix: column names and apply map by @LumberjackUsingMath in #2492
- fix: Gremlin batch size calc by @kukushking in #2496
Documentation 📚
- docs: Update layers.rst - add cn-north-1 & cn-northwest-1 by @kukushking in #2477
New Contributors
- @rabingaire made their first contribution in #2483
- @frenchytheasian made their first contribution in #2482
- @LumberjackUsingMath made their first contribution in #2492
Full Changelog: 3.4.0...3.4.1
AWS SDK for pandas 3.4.0
Features/Enhancements 🚀
- Geospatial - parse Athena geospatial types via geopandas by @kukushking in #2346
- Allow group identifiers to be used in
wr.cloudwatch
queries by @LeonLuttenberger in #2430 - Add ignore null store parquet metadata by @raaidarshad in #2450
Bug fixes 🐛
- Add missing boto3 session in
athena.to_iceberg
wait_query by @jaidisido in #2428 - Add catalog ID in
athena.to_iceberg
by @jaidisido in #2446 - Return None for missing column and partition key comment by @robert-schmidtke in #2449
- Fix urllib3 error when building AWS Lambda Layers by @LeonLuttenberger in #2447
- Duplicate schema argument in
wr.s3.to_parquet
by @kukushking in #2455
Tests 🧪
- Test dependabot groups feature by @jaidisido in #2426
New Contributors
- @raaidarshad made their first contribution in #2450
Full Changelog: 3.3.0...3.4.0
AWS SDK for pandas 3.3.0
Features/Enhancements 🚀
- Support Athena query prepared statements & Athena parameterized queries by @LeonLuttenberger in #2344
- Add dtype parameter in to_iceberg function by @paulobrunheroto in #2359
- Add CleanRooms read module by @jaidisido in #2366
- Escape and validate table identifiers and literals in PostreSQL by @kukushking in #2390
- Add Python 3.11 support by @moralesl in #2414
Bug fixes 🐛
- Escape column names in PRIMARY KEY statement in SQL query by @mc51 in #2351
- Remove .lower in dtype sanitize for to_parquet by @jaidisido in #2369
- Enforce use_threads=False when Limit is supplied by @jaidisido in #2372
- Fix Boto3 session not being passed to
cleanrooms.wait_query
by @LeonLuttenberger in #2381 - Allow ANSI-compatible identifiers in RDS Data API by @kukushking in #2391
- Pass schema to chunked parquet reads by @kukushking in #2400
- Support pyarrow schema in DynamoDB read_items #2399 by @jaidisido in #2401
- Upgrade Ray to 2.6 and fix security dependabots by @jaidisido in #2403
- Fix Arrow timezone localization by @kukushking in #2411
- Use from_arrow instead of from_arrow_refs by @jaidisido in #2417
Tests 🧪
- Make minimal tests run on mac and windows by @LeonLuttenberger in #2347
- Add Aurora PostgreSQL Serverless by @kukushking in #2388
New Contributors
- @mc51 made their first contribution in #2351
- @paulobrunheroto made their first contribution in #2359
- @moralesl made their first contribution in #2414
Full Changelog: 3.2.1...3.3.0
AWS SDK for pandas 3.2.1
Fixes 🛠️
- Fix error where library could not be imported on Windows due to
No module named 'pyarrow._orc'
by @LeonLuttenberger in #2341 #2337 - Lower
packaging
version requirement by @LeonLuttenberger in #2340 - Allow Ray 2.5 & downgrade tox by @kukushking in #2338
Full Changelog: 3.2.0...3.2.1
AWS SDK for pandas 3.2.0
Features/Enhancements 🚀
- Add
s3.read_orc
ands3.to_orc
by @LeonLuttenberger in #2312 🔥 - Apache Spark on Amazon Athena -
wr.athena.create_spark_session
&wr.athena.run_spark_calculation
by @kukushking in #2314 🚀 - EMR Serverless by @kukushking in #2304 🔥
- Add
to_sql
for RDS Data API by @LeonLuttenberger in #2287 - Add Timestream
UNLOAD
by @kukushking in #2284 - Opensearch parallel bulk by @kukushking in #2310
- Allow user groups to be passed in
allowed_to_use
andallowed_to_manage
when creating QuickSight resources by @LeonLuttenberger in #2278 - Add engine/memory_format os env variables and delay engine initialization by @jaidisido in #2285
- Support reading with PyArrow-backed types by @LeonLuttenberger in #2292
- Support additional parameters for Neptune bulk load by @LeonLuttenberger in #2297
- Sync ray 2.4 parquet datasource by @kukushking in #2300
- Timestream: Add multi measure write record example by @mandawat in #2317
- Iceberg
PARTITIONED BY
and additional table properties support by @kukushking in #2322 - Add ability to pass schema to
s3.read_parquet
by @kukushking in #2328
Bug fixes 🐛
- Fix recurring issue with
test_spectrum_decimal_cast
by @LeonLuttenberger in #2283 - Fix Redshift unload not escaping SQL query by @LeonLuttenberger in #2286
- Fix KeyError & add lock to athena cache manager by @kukushking in #2299
- Fix Neptune bulk load bad request by @LeonLuttenberger in #2305
- Add AWS_REGION by default to deltalake storage_options by @jaidisido in #2315
Documentation 📚
- Add page for data_api.rds.to_sql by @LeonLuttenberger in #2291
Tests 🧪
- Add unit test for
dtype_backend
use inread_parquet_table
by @LeonLuttenberger in #2307 - Adapt benchmark tests to Glue for Ray GA breaking changes by @jaidisido in #2316
Refactoring 🛠️
- Refactor SQL formatter by @LeonLuttenberger in #2288
- Refactor engine
register_func
to handle type checking by @LeonLuttenberger in #2309
New Contributors
Full Changelog: 3.1.1...3.2.0
AWS SDK for pandas 3.1.1
What's Changed
- fix: Add missing
packaging
dependency by @LeonLuttenberger in #2281
Full Changelog: 3.1.0...3.1.1
AWS SDK for pandas 3.1.0
Features/Enhancements 🚀
- Add
neptune.bulk_load
for bulk loading data into Neptune by @LeonLuttenberger in #2238 #2267 - Add
s3.to_deltalake
function by @LeonLuttenberger in #2228 - Add Timestream Batch Load support by @jaidisido in #2214
- Add Iceberg insert by @kukushking in #2233
- Support upsert mode for OracleDB by @LeonLuttenberger in #2265
- Add
chunked
parameter to DynamoDB read functions by @LeonLuttenberger in #2227 - Upgrade Modin to 0.20.1 & allow Ray 2.4 by @kukushking in #2234
- Support Glue Connection SSM credential type by @kukushking in #2232
- Add ability to pass schema to S3 Select by @kukushking in #2237
- Add dynamic classification EMR config by @LLejoly in #2250
- Add support for server-side cursors in PostgreSQL module by @kukushking in #2262
- Add time unit to Timestream write API by @jaidisido in #2263
Fixes 🛠️
- Set
ignore_metadata
toFalse
by default by @jaidisido in #2206 - Fix conflicting types for
path_ignore_suffix
by @LeonLuttenberger in #2240 - Athena workgroup query engine v3 upgrade artifacts by @kukushking in #2243
- Fixing
test_spectrum_decimal_cast
test by @LeonLuttenberger in #2244 emr.create_cluster
was not passing security configuration to internal method by @malachi-constant in #2246- Fix pagination in
timestream.list_tables
by @SukruHan #2275
Documentation 📚
- Include our ADRs in GitHub by @LeonLuttenberger in #2215 #2259
- Fixes in the Athena Cache tutorial by @patrick-muller in #2201
- Write ADR for the switching between PyArrow and Pandas I/O functions by @LeonLuttenberger in #2245
- Fix "about" URL in README by @CGarces in #2207
- Update
layers.rst
with Python 3.10 layers by @LeonLuttenberger in #2219 - Fix links to 'Who uses library' section by @LeonLuttenberger in #2241
- Declutter function overloads by extracting overloads to
pyi
files by @LeonLuttenberger in #2229 #2255 #2256
Full Changelog: 3.0.0...3.1.0
AWS SDK for pandas 3.0.0
Breaking changes 💥
- Move dependencies to optional by @jaidisido in #1992 🔓
- Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
- The required dependencies can be easily installed with
pip install awswrangler[<MODULE_NAME>]
, for examplepip install awswrangler[redshift]
- Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 💾
- For example a parameter of type
dt.datetime
is parsed intoDATETIME xxxx-xx-xx xx:xx:xx
, while a parameter of typestr
is formatted into"x"
- For example a parameter of type
- Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a
TypeDict
by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 💼- Glue catalog parameters are grouped together in
to_parquet
,to_csv
andto_json
- Athena UNLOAD and CTAS parameters are grouped together
- Glue catalog parameters are grouped together in
- Deprecate
wr.s3.merge_upsert_table
by @kukushking in #2076⚠️ - Deprecate
updated_name
parameter inupdate_ruleset
by @jaidisido in #2122⚠️ - Stop support for Python 3.7
⚠️
New functionalities 🚀
AWS SDK for pandas can now run at scale 🚀💻🚀
Tutorials
- 034 - Distributing Calls Using Ray
- 035 - Distributing Calls on Ray Remote Cluster
- 036 - Distributing Calls with Glue Interactive Sessions on Ray
AWS Blogs
Features/Enhancements 🚀
- Thread-safety improvements by @kukushking in #2186
- Allow Python 3.11 by @kukushking in #2101 🐍
- Add
use_theads
parameter todynamodb.read_items
by @LeonLuttenberger in #2113 📈 - Distribute
wr.dynamodb.put_df
with executor task by @LeonLuttenberger in #2118 📈 - Add additional arg for glue database
DatabaseInput
by @malachi-constant in #2067 🔧 - Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
- Add support for boto3 kwargs to
timestream.create_table
by @cnfait in #1819 - Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
- Upgrade to Ray 2.0 by @kukushking in #1635
- Add partitioning on block level by @kukushking in #1653
- Use fast file metadata provider by @kukushking in #1997
- Distribute DynamoDB Parallel Scan by @jaidisido in #1981
- Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
- Add distributed variant of the
_read_parquet_metadata_file
function based on the PyArrow file system by @LeonLuttenberger in #2050 - Validate distributed kwargs by @kukushking in #2051
- Add
@Experimental
and@Deprecated
annotations by @kukushking in #2062 - Distribute S3
describe_objects
by @jaidisido in #2069 - Distributed S3 copy/merge by @kukushking in #2070
- Add
bulk_read
option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033 - Deprecate boto3 resources by @kukushking in #2097
- Add retries for s3 select by @kukushking in #1780
- Make tqdm progress reporting opt-in by @kukushking in #1741
- Distribute data types inference by @jaidisido in #1692
- Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
- Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
- Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
- Validate partitions along row axis, add warning by @kukushking in #1700
- Refactor executor module by @kukushking in #2120
- Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
- Distribute Timestream write with executor by @jaidisido in #1715
- Distribute
s3.to_json
ands3.to_csv
by @LeonLuttenberger in #1631 - Distribute
s3.read_csv
,s3.read_json
ands3.read_fwf
by @LeonLuttenberger in #1567 #1607 - Distribute
s3.wait_objects
by @LeonLuttenberger in #1539 - Distribute
s3.to_parquet
by @kukushking in #1526 - Distribute
s3.delete objects
by @malachi-constant in #1474 - Distribute
s3.read_parquet
by @jaidisido in #1513 - Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed
s3.select_query
by @kukushking in #1446 - Add distributed Lake Formation read by @jaidisido in #1397
- Refactor ray datasources by @kukushking in #1687
- Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
- Add
Literal
typing formode
andprojection_types
by @LeonLuttenberger in #2191
Fixes 🛠️
- Sanitize bucketing col names by @kukushking in #2155
- Allow writing files from an empty dataframe by @malachi-constant in #2045
- Athena out of bound dates by @kukushking in #2180
- Fix partition block overwriting by @kukushking in #1695
- Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
- Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
- Add retries to
read_parquet_metadata_distributed
by @jaidisido in #2196 - Fix default
utcnow
argument instart_query
by @LeonLuttenberger in #2193
Documentation 📚
- Athena Iceberg tutorial by @kukushking in #2117
- Add at scale section by @kukushking in #2119
- Documentation spell-checking improvements by @LeonLuttenberger in #2165
- Add AWS Glue on Ray docs by @jaidisido in #1810
- Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
- Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
- Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
- Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
- Add typeddicts to docs by @LeonLuttenberger in #2167
Tests 🧪
- Add PR linter Github action by @jaidisido in #2106
- Replace load tests bucket with SSM parameter by @jaidisido in #2121
- opensearch index cleanup / skip by @kukushking in #2149
- Add benchmark tests by @jaidisido in #2143
- Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
- Remove
awswrangler.distributed
from coverage report by @LeonLuttenberger in #1884 - Consolidate unit and load tests by @jaidisido in #1525
- Distribute tests in tox config by @malachi-constant in #1469
Full Changelog: 2.20.1...3.0.0