Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Welsh "HMRC Contact" specialist documents #3226

Merged
merged 1 commit into from
Mar 31, 2025

Conversation

ChrisBAshton
Copy link
Contributor

@ChrisBAshton ChrisBAshton commented Mar 28, 2025

We're migrating the HMRC Contacts Admin over to a general "HMRC Finder" Specialist Finder. HMRC need to be able to publish both English and Welsh specialist documents (of type hmrc_contact) to serve user needs.

Whilst the specialist document itself renders well with a locale of cy (Welsh page furniture and correct lang attribute set on the <main> HTML element), Welsh specialist documents aren't currently surfaced in the HMRC Finder because Search API rejects all non-English documents at the point of indexing. See related PR here:
#1810

This means that the "Cymraeg / Welsh" facet we've added to the HMRC Finder (for feature parity with the Contacts Admin) doesn't work: it always returns 0 documents.
alphagov/specialist-publisher#3011

Non-English documents were omitted from Search API because:

there's no way to filter by language
our stemming and synonyms are only set up for English
We have non-English content in Rummager and don't know what language it is as Rummager does not store it
We show Welsh results on search pages
We may inadvertently be showing some non-English content in feeds / English pages

That is no longer much of a concern in 2025, because site search uses search-api-v2, which is unaffected by this change. That being said, site search falls back to search-api (v1) if no query param is provided, so anyone looking at https://www.gov.uk/search/all?order=updated-newest could see lots of non-English content surfacing if we're not careful about how we make this change.

We're therefore scoping this change to just Welsh, and just HMRC Contact specialist document type. There will only be a few of them, so the chances of these being unexpectedly surfaced outside of the HMRC Finder itself are very slim.

A better solution would be to start indexing the locale of all documents, and then applying a locale filter everywhere in our frontend apps that makes calls to Search API. But that's a pretty sizeable change, especially considering that we generally want to be moving to Search API v2. So this was considered a suitable stop-gap in the meantime.
We've logged the wider issue as a publishing tech debt card to revisit later: https://trello.com/c/ZzszTweH/

Trello for this work: https://trello.com/c/fXxLTdGk/

We're migrating the HMRC Contacts Admin over to a general "HMRC
Finder" Specialist Finder. HMRC need to be able to publish both
English and Welsh specialist documents (of type `hmrc_contact`)
to serve user needs.

Whilst the specialist document itself renders well with a locale
of `cy` (Welsh page furniture and correct `lang` attribute set on
the `<main>` HTML element), Welsh specialist documents aren't
currently surfaced in the HMRC Finder because Search API rejects
all non-English documents at the point of indexing. See related
PR here:
#1810

This means that the "Cymraeg / Welsh" facet we've added to the
HMRC Finder (for feature parity with the Contacts Admin) doesn't
work: it always returns 0 documents.
alphagov/specialist-publisher#3011

Non-English documents were omitted from Search API because:

> there's no way to filter by language
> our stemming and synonyms are only set up for English
> We have non-English content in Rummager and don't know what language it is as Rummager does not store it
> We show Welsh results on search pages
> We may inadvertently be showing some non-English content in feeds / English pages

That is no longer much of a concern in 2025, because site search
uses search-api-v2, which is unaffected by this change. That being
said, site search falls back to search-api (v1) if no query param
is provided, so anyone looking at <https://www.gov.uk/search/all?order=updated-newest>
could see lots of non-English content surfacing if we're not
careful about how we make this change.

We're therefore scoping this change to just Welsh, and just HMRC
Contact specialist document type. There will only be a few of them,
so the chances of these being unexpectedly surfaced outside of the
HMRC Finder itself are very slim.

A better solution would be to start indexing the locale of all
documents, and then applying a locale filter everywhere in our
frontend apps that makes calls to Search API. But that's a pretty
sizeable change, especially considering that we generally want
to be moving to Search API v2. So this was considered a suitable
stop-gap in the meantime.
We've logged the wider issue as a publishing tech debt card to
revisit later: https://trello.com/c/ZzszTweH/

Trello for this work: https://trello.com/c/fXxLTdGk/
@ChrisBAshton ChrisBAshton force-pushed the index-welsh-hmrc-contacts branch from 4ad87f2 to 086eba0 Compare March 28, 2025 15:21
@ChrisBAshton ChrisBAshton merged commit c16199a into main Mar 31, 2025
6 checks passed
@ChrisBAshton ChrisBAshton deleted the index-welsh-hmrc-contacts branch March 31, 2025 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants