Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: geojson is not read correctly with geopandas>=1.0.0 #445

Open
2 of 3 tasks
veenstrajelmer opened this issue Jul 7, 2024 · 6 comments
Open
2 of 3 tasks

BUG: geojson is not read correctly with geopandas>=1.0.0 #445

veenstrajelmer opened this issue Jul 7, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Jul 7, 2024

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of geopandas.

  • (optional) I have confirmed this bug exists on the main branch of geopandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import geopandas as gpd
uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson")
time_min = uhslc_gpd["fd_span"].apply(lambda x: x["oldest"])

Problem description

The above code raises "TypeError: string indices must be integers, not 'str'" in geopandas>=1.0.0. For older versions the code runs successfully. The issue is that the column now contains strings with dicts instead of plain dicts. It seems that something goes wrong with the parsing of the geojson.

Expected Output

A subset of the original column.

Output of geopandas.show_versions()

SYSTEM INFO

python : 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:29:11) [MSC v.1935 64 bit (AMD64)]
executable : C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\python.exe
machine : Windows-10-10.0.19045-SP0

GEOS, GDAL, PROJ INFO

GEOS : 3.11.2
GEOS lib : None
GDAL : 3.8.5
GDAL data dir: C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\Lib\site-packages\pyogrio\gdal_data
PROJ : 9.3.0
PROJ data dir: C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\Lib\site-packages\pyproj\proj_dir\share\proj

PYTHON DEPENDENCIES

geopandas : 1.0.0
numpy : 1.26.4
pandas : 2.2.2
pyproj : 3.6.1
shapely : 2.0.2
pyogrio : 0.9.0
geoalchemy2: None
geopy : 2.4.1
matplotlib : 3.8.4
mapclassify: None
fiona : 1.9.5
psycopg : None
psycopg2 : None
pyarrow : None

@martinfleis
Copy link
Member

martinfleis commented Jul 8, 2024

Thanks for the report! I can confirm that with the new default IO engine pyogrio, this indeed returns a string.

A workaround is to use the old engine that was default pre 1.0.

uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson", engine="fiona")

@brendan-ward will know more whether this is expected or something we need to process differently in pyogrio.

@veenstrajelmer
Copy link
Author

@martinfleis thanks a lot for this useful suggestion, this conveniently solves the issue I had at least on my side. However, the engine string seems to be case sensitive, so it should be engine='fiona'.

@martinfleis
Copy link
Member

I'll keep this open and move it to pyogrio as we may want to look into that there.

@martinfleis martinfleis reopened this Jul 8, 2024
@martinfleis martinfleis transferred this issue from geopandas/geopandas Jul 8, 2024
@brendan-ward
Copy link
Member

It looks like there is a field type OFSTJSON that Fiona is using in this case to automatically convert to dict, and on write, automatically convert dict / list values when serializing.

On the Pyogrio side, we need to detect this subtype and carry through that info when deserializing / serializing fields. Serializing is likely to be harder because the numpy array dtype does not give us this info - so there may be a real performance penalty there (or we leave this the responsibility of the user).

For now, you could also manually parse applicable fields to dict and still get the speedups of Pyogrio:

import json

uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson")
uhslc_gpd["rq_span"] = uhslc_gpd.rq_span.apply(json.loads)

@veenstrajelmer
Copy link
Author

Thanks for the suggestion. That would also work indeed, but "rq_span" is not the only field that requires conversion, so for my application I prefer the fiona approach for now.

@veenstrajelmer
Copy link
Author

FYI, engine="fiona" fails for fiona>=1.10.0 as documented in Toblerity/Fiona#1451. I hope this or that issue will soon be resolved so I can also read this geojson (including dict parsing) with the newest versions again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants