-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(search): Adds logic to download search results #4893
Conversation
535f37a
to
e8f6fd3
Compare
b3ba8d5
to
8721c41
Compare
Semgrep found 3 Avoid using Ignore this finding from avoid-pickle Semgrep found 1 Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS. |
8721c41
to
ae29bba
Compare
This commit refactors the search module by moving helper functions from `view.py` to `search_utils.py`. This improves code organization and makes these helper functions reusable across different modules.
ae29bba
to
92cddf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a once-over and it feels about right. Concerns I'll highlight for you guys to consider:
-
Memory: We're putting the CSV in memory, which sure is handy. I think this is fine b/c it'll be pretty small, a couple hundred KB, right? This must be fine, but it's on my mind.
-
The fields in the result might be annoying with columns that aren't normalized to human values (like SOURCE: CR or something, and local_path: /recap/gov.xxxx.pdf instead of https://storage.courtlistener.com/recap/gov.xxx.pdf). I didn't see code to fix that, but it's probably something we should do if we can. This CSV is supposed to be for humans, in theory.
I appreciate the refactor, but I'd suggest it in a separate PR in the future, so it's not mixed in.
But this looks about right to me otherwise. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ERosendo this looks good and is on the right track. I’ve left some comments and suggestions in the code, along with additional feedback here:
-
In addition to Mike's comment about normalizing values for humans, a similar suggestion is that I noticed that the CSV headers don’t maintain a fixed order when the CSV is generated and I found it difficult to determine when the results belong to the same "Case," particularly when matching child documents. It might be a good idea to ensure that the headers are fixed for each search type and to prioritize key headers that help identify whether the results belong to the same case. For instance, in RECAP, the headers could start like this:
docket_id, docket_number, pacer_case_id, court_exact, case_name, document_number, attachment_number, ...
-
Highlighted fields in the results are always represented as a list of terms, even though we are only highlighting a single fragment. Most of the HL fields are not naturally lists of terms. However, some fields, such as citations in case law, can be lists and are also highlighted.
As in the frontend, perhaps you could use the render_string_or_list
filter or a modified version of it to render HL fields as strings instead of lists when they are not multi-fields?
-
Regarding Judge Search, I noticed that the CSV only contains fields from
PersonDocument
, and some fields currently rendered as flat fields in the frontend (such as "Appointers" and other similar fields extracted from the database inmerge_unavailable_fields_on_parent_document
) are not included. Is this behavior expected? -
I'd recommend adding at least one integration test to help catch and prevent regressions in the future related to the suggestions and bugs mentioned in the comments.
Thank you!
This method returns a predefined, fixed-order list of header strings for generating CSV files from search results, ensuring consistent output.
Introduces `is_csv_export` to `do_es_search`, allowing retrieval of all results up to `MAX_SEARCH_RESULTS_EXPORTED` for CSV exports, bypassing the `RECAP_SEARCH_PAGE_SIZE` limit. Refactors `fetch_es_results_for_csv` to remove the redundant while loop.
@albertisfu Thanks for your comments. This PR is ready for another review. Here's a summary of the changes implemented to address your comments.
Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ERosendo! This looks good! I just have a few additional suggestions.
One thing I noticed is that rendering dockets with no documents still needs a couple tweaks, as mentioned in comments in detail.
50fbe52
to
3f324c6
Compare
2a84a76
to
503453d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ERosendo this looks great now and ready to go!
Just a note from our conversation: When we work on adding the button to the UI for requesting the results to be downloaded, we should implement a validation to ensure the button is only shown for valid searches that return results. This will help avoid triggering the task unnecessarily and retrying the task on queries with syntax errors.
Set to auto-merge!
Woo, congrats guys on getting this one done. It was harder than it seemed! |
This PR implements the backend logic for exporting search results (#599).
Key changes:
Introduces a new rate limiter to throttle CSV export requests to 5 per day
Adds a new setting named
MAX_SEARCH_RESULTS_EXPORTED
(default: 250) to control the maximum number of rows included in the generated CSV file.Refactors the
view.py
file within the search module. Helper functions related to fetching Elasticsearch results have been moved to thesearch_utils.py
file for better organization and clarity.Introduces two new helper functions
fetch_es_results_for_csv
andget_headers_for_search_export
Adds a new task that takes the
user_id
and thequery
string as input. It then sends an email with a CSV file containing at mostMAX_SEARCH_RESULTS_EXPORTED
rows.