Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

66 scotland features #70

Draft
wants to merge 15 commits into
base: dev
Choose a base branch
from
Draft

66 scotland features #70

wants to merge 15 commits into from

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Sep 26, 2024

Fixes #66

Description

Add datasets for Scottish features (reweighting features added in a separate branch).

Added so far:

  • World Heritage Site status for Scotland: see changes to protected_areas.py
  • listed buildings in Scotland - listed_buildings.py
  • run_download_inspire.py - new Script to download INSPIRE land registry files for Scotland
  • 20240926_explore_scotland_listed_buildings.py - new Notebook to identify threshold distance for assigning nearest listed building neighbour to UPRN

NB: I need to add the new datasets to the README.

Instructions for Reviewer

Please could you pay special attention to:

  • the join logic for listed buildings, conservation areas, and world heritage sites
  • the notebook analysis to check the threshold has been calculated and selected appropriately
  • review the nearest neighbour join logic

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

…use points instead of boundaries.

this is due to boundaries dataset not being complete
…ll as England and Wales

- fix bug in Wales data processing. Wales listed buildings dataset is shared as point geometries, not polygons so update from sjoin to sjoin_nearest
- refactor the EPC chunking function slightly to be more succinct
- add WHS dataset to config base.yaml
- rename conservation_areas.py to protected_areas.py
- add functions to protected_areas.py to transform WHS data
- update cons area functions in protected_areas.py to return properties in cons areas only
- update listed_buildings.py functions to return properties in listed buildings only
- update geospatial functions that join data to EPC to use epc geodataframe where possible instead of generating new gdf every time - changes to listed_buildings.py; output_areas.py; protected_areas.py
- update run_add_features.py script to use new WHS functions and to use EPC gdf instead of EPC df
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @crispy-wonton - this looks great :) ! The joins all made sense to me, I just had a few comments.

I wonder whether the threshold of max_distance = 5 might be too harsh - but I can't remember your exact reasoning for it? I did a very brief analysis at the end of your notebook - and increasing it to 10 would allow many more of the correct matches to be outputted as a match (recall from 0.77 to 0.9), without effecting the precision too much (1 to 0.96). What do you think?

Screenshot 2024-10-11 at 15 48 04

Comment on lines +107 to +109
df = epc_gdf.sjoin_nearest(
listed_buildings_gdf, how="inner", max_distance=distance
)[["UPRN", "listed_building"]].drop_duplicates(subset="UPRN")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks right.

Thinking aloud: So you find all the points or geometries of listed properties within a certain distance of each UPRN, then drop duplicates (since you don't care about which listed property they match to, just that they match to at least one).

Comment on lines +110 to +118
elif any(
[
expr in listed_buildings_gdf.geom_type.unique()
for expr in ["Polygon", "MultiPolygon"]
]
):
df = epc_gdf.sjoin(listed_buildings_gdf, how="inner", predicate="intersects")[
["UPRN", "listed_building"]
].drop_duplicates(subset="UPRN")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume sjoin can be used for both points and polygons, so why use sjoin_nearest at all?


# Replace `lad_code` from postcode with `lad_code` from geospatial join and postcode
logging.info("Adding LAD code with geospatial join")
uprn_lad_df = output_areas.sjoin_df_uprn_lad_code(epc_df)
uprn_lad_df = output_areas.sjoin_df_uprn_lad_code(epc_gdf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked through all the joins in this script + the functions that are referenced. The joins make sense to me! At first I was wondering why you'd not just left joined everything to epc_gdf (rather than inner join to epc_gdf and then left join of result to epc_df), but I think the way youve done it is more clear and efficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (?) it has to be an sjoin_nearest because with sjoin, these are the predicates for joining:
https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#binary-predicate-joins

  • intersects
  • contains
  • within
  • touches
  • crosses
  • overlaps
    So points that aren't exactly touching/ on top of each other will not get joined together. This was the case with the Wales listed buildings sjoin that was in the code before. It does join some of the buildings, but you get much more with the sjoin_nearest methodology. Does this make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh makes sense! Thanks for a explanation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add features for Scotland
2 participants