-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
66 scotland features #70
base: dev
Are you sure you want to change the base?
Conversation
…use points instead of boundaries. this is due to boundaries dataset not being complete
…ll as England and Wales - fix bug in Wales data processing. Wales listed buildings dataset is shared as point geometries, not polygons so update from sjoin to sjoin_nearest - refactor the EPC chunking function slightly to be more succinct
…e nearest neighbour distance threshold
- add WHS dataset to config base.yaml - rename conservation_areas.py to protected_areas.py - add functions to protected_areas.py to transform WHS data - update cons area functions in protected_areas.py to return properties in cons areas only - update listed_buildings.py functions to return properties in listed buildings only - update geospatial functions that join data to EPC to use epc geodataframe where possible instead of generating new gdf every time - changes to listed_buildings.py; output_areas.py; protected_areas.py - update run_add_features.py script to use new WHS functions and to use EPC gdf instead of EPC df
- add INSPIRE scotland data source to S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @crispy-wonton - this looks great :) ! The joins all made sense to me, I just had a few comments.
I wonder whether the threshold of max_distance = 5 might be too harsh - but I can't remember your exact reasoning for it? I did a very brief analysis at the end of your notebook - and increasing it to 10 would allow many more of the correct matches to be outputted as a match (recall from 0.77 to 0.9), without effecting the precision too much (1 to 0.96). What do you think?
df = epc_gdf.sjoin_nearest( | ||
listed_buildings_gdf, how="inner", max_distance=distance | ||
)[["UPRN", "listed_building"]].drop_duplicates(subset="UPRN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks right.
Thinking aloud: So you find all the points or geometries of listed properties within a certain distance of each UPRN, then drop duplicates (since you don't care about which listed property they match to, just that they match to at least one).
elif any( | ||
[ | ||
expr in listed_buildings_gdf.geom_type.unique() | ||
for expr in ["Polygon", "MultiPolygon"] | ||
] | ||
): | ||
df = epc_gdf.sjoin(listed_buildings_gdf, how="inner", predicate="intersects")[ | ||
["UPRN", "listed_building"] | ||
].drop_duplicates(subset="UPRN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume sjoin
can be used for both points and polygons, so why use sjoin_nearest
at all?
|
||
# Replace `lad_code` from postcode with `lad_code` from geospatial join and postcode | ||
logging.info("Adding LAD code with geospatial join") | ||
uprn_lad_df = output_areas.sjoin_df_uprn_lad_code(epc_df) | ||
uprn_lad_df = output_areas.sjoin_df_uprn_lad_code(epc_gdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked through all the joins in this script + the functions that are referenced. The joins make sense to me! At first I was wondering why you'd not just left joined everything to epc_gdf
(rather than inner join to epc_gdf
and then left join of result to epc_df
), but I think the way youve done it is more clear and efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (?) it has to be an sjoin_nearest
because with sjoin
, these are the predicates for joining:
https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#binary-predicate-joins
- intersects
- contains
- within
- touches
- crosses
- overlaps
So points that aren't exactly touching/ on top of each other will not get joined together. This was the case with the Wales listed buildings sjoin that was in the code before. It does join some of the buildings, but you get much more with thesjoin_nearest
methodology. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh makes sense! Thanks for a explanation :)
Fixes #66
Description
Add datasets for Scottish features (reweighting features added in a separate branch).
Added so far:
protected_areas.py
listed_buildings.py
run_download_inspire.py
- new Script to download INSPIRE land registry files for Scotland20240926_explore_scotland_listed_buildings.py
- new Notebook to identify threshold distance for assigning nearest listed building neighbour to UPRNNB: I need to add the new datasets to the README.
Instructions for Reviewer
Please could you pay special attention to:
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s