-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awswrangler.athena.to_iceberg not supporting to synchronous/parallel lambda instances. #2651
Comments
Hi @B161851 , if you are inserting concurrently, you ned to make sure |
@kukushking hi, facing with same, even with two concurrent writers (lambdas). Table exists. Trying to perform upsert (MERGE INTO) operation. In my case upsert happens even on diffrent partitions (different parts of a table), so I don't think it's a race condition.
ICEBERG_COMMIT_ERROR: Failed to commit Iceberg update to table |
Just wanted to bump this issue up as well. Particular use case is uses Have had to lock lambda concurrency to 1 to avoid the |
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed. |
bump |
bump. addressing this feature will be very helpful |
All, looks like this is service-side issue. Please raise a support request. @ChanTheDataExplorer @Salatich @vibe is it also |
in my side it is just |
We're seeing a lot of |
When working with parallel writing to an Iceberg table using awswrangler in AWS Lambda, there are some specific considerations and configurations to handle to avoid issues like duplicates,
Best approach:-
Hope this helps, do let me know, if there are similar issues |
@Siddharth-Latthe-07 want to check with you, your recommendations do not solve the Also @peterklingelhofer / @Salatich did you get any such luck to this issue. |
@Acehaidrey Here are some of the additional strategies you can look for:-
Combining unique temporary paths with proper locking mechanisms, using AWS Glue for coordination, leveraging Iceberg's conflict resolution, and implementing retries can help overcome the ICEBERG_COMMIT_ERROR and other issues during parallel writes. |
Hi @Acehaidrey , unfortunately no luck. In my approach - I use unique names for temp_paths and update different partitions (so, there is no race I believe). Also, the table exists, so there is no race condition on creating a table. I'm using exponential backoff - it kind of helps, but I see retry warnings in my lambdas constantly with ICEBERG_COMMIT_ERROR error. Also, this statement confuses me (from https://repost.aws/knowledge-center/athena-iceberg-table-error): |
Thank you @Siddharth-Latthe-07 . Think this will slow down the program indeed. But seems the case is to limit the parallelism? which isnt the solution we want to go towards :/ |
Describe the bug
For parallel writing, If keep_files=True then it is resulting the duplicates and I tried appending the nano timestamp to the temporary path so it's unique but now I have "ICEBERG_COMMIT_ERROR"
If keep_files=False then it is giving "HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error" when ingesting iceberg data in parallel
and we observed if keep_files=False then in that library entire temp_path was removed from the s3 and getting the above error.
It's not supporting to write to the iceberg table using wrangler from lambda.
So, how can we overcome the above issues in lambda parallel writing to iceberg table using awswrangler.
How to Reproduce
we observed if keep_files=False then in that library entire temp_path was removed from the s3 and resulted "HIVE_CANNOT_OPEN_SPLIT NoSuchKey Error"
if you remove the particular parquet file from the temp_path instead of removing entire temp_path from s3, I think might give the above error.
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Win
Python version
3.8
AWS SDK for pandas version
12
Additional context
No response
The text was updated successfully, but these errors were encountered: