Identifying old entities in the brownfield-land dataset #355
greg-slater
started this conversation in
Ideas and feedback
Replies: 1 comment 1 reply
-
Thanks for this, Greg! With regards to calculating or inferring end dates for old entities, is one option to present the ‘anomaly’ back to the data provider so that they can resolve it in the source data? They might choose to add an end date for an entity, or they could tell us to stop collecting from an endpoint (allowing us to add an end date). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem description
There is a large number of brownfield-land entities with references that have originally come from an early resource but no longer appear in more recent resources or endpoints for the same provision. This suggests that the record has been removed by the provider instead of using the end-date field. We’d like to make it clearer that these are old records so are proposing a method we could use to identify these and add in our own calculated or inferred end-date for these affected entities.
We think that out of the ~35K brownfield-land there are almost 11K which have references which don’t appear in the latest resource for a provision, and should therefore possibly be given an end date.
Example
Blackburn with Darwen Borough Council have supplied us 3 working endpoints since 2018, two of which are still active (i.e. don't have an end-date).
If we check the reference facts for entity 1700658 we can see that the latest fact for the reference field comes from the oldest resource, meaning it hasn't appeared since on any subsequent resources.
Proposed method
Because we keep multiple endpoints per provision active for brownfield-land we can't just take the latest resource fromt the only active endpoint as the latest update. So for each provision we propose to use the list of reference values from the latest resource from the most recently added endpoint as the “up-to-date” list. Any entities from the provision with references that only appear on older resources (and which don’t already have an end-date) will be given an end-date, from one of two places:
This datasette query shows how the Blackburn with Darwen resources from the example above would be sorted using the
endpoint_entry_date
andresource_start_date
, and how the entity end date could be worked out from either theresource_end_date
or theendpoint_entry_date
of the following endpoint when a resource is still active.Once we work out the end-dates required for any entities we can host a list of entities and end-dates ourselves and add it as a new source & endpoint.
NOTE
Handling future updates
The rationale for keeping historic BFL endpoints active is so that providers can make retrospective edits if necessary. This may affect our calculated end-dates in a few ways:
Provider adds end-date values to historic records which we have already calculated an end-date for. In this case the new fact should supersede the one we originally worked out as the entry-date will be more recent.
Provider removes records from a historic endpoint. This would mean our calculated end-date should be updated (from the date that a newer endpoint was added, to the date that the last resource ended). Designing a way to maintain this programmatically may be challenging. It could be worth working out how often this happens before designing a process.
Provider adds records to a historic endpoint that are not on a newer endpoint. Following our normal process we’d assign a new entity for the record, but then it should also get given an end-date from this new process as it doesn’t exist on the latest resource.
Beta Was this translation helpful? Give feedback.
All reactions