Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): Adds logic to download search results #4893

Merged
merged 38 commits into from
Mar 11, 2025

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Jan 6, 2025

This PR implements the backend logic for exporting search results (#599).

Key changes:

  • Introduces a new rate limiter to throttle CSV export requests to 5 per day

  • Adds a new setting named MAX_SEARCH_RESULTS_EXPORTED (default: 250) to control the maximum number of rows included in the generated CSV file.

  • Refactors the view.py file within the search module. Helper functions related to fetching Elasticsearch results have been moved to the search_utils.py file for better organization and clarity.

  • Introduces two new helper functions fetch_es_results_for_csv and get_headers_for_search_export

  • Adds a new task that takes the user_id and the query string as input. It then sends an email with a CSV file containing at most MAX_SEARCH_RESULTS_EXPORTED rows.

@freelawproject freelawproject deleted a comment from semgrep-app bot Jan 6, 2025
@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch 2 times, most recently from 535f37a to e8f6fd3 Compare January 13, 2025 18:31
@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from b3ba8d5 to 8721c41 Compare January 16, 2025 02:32
Copy link

semgrep-app bot commented Jan 16, 2025

Semgrep found 3 avoid-pickle findings:

Avoid using pickle, which is known to lead to code execution vulnerabilities. When unpickling, the serialized data could be manipulated to run arbitrary code. Instead, consider serializing the relevant data as JSON or a similar text-based serialization format.

Ignore this finding from avoid-pickle

Semgrep found 1 direct-use-of-jinja2 finding:

Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.

Ignore this finding from direct-use-of-jinja2

@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from 8721c41 to ae29bba Compare January 16, 2025 04:07
This commit refactors the search module by moving helper functions from `view.py` to `search_utils.py`. This improves code organization and makes these helper functions reusable across different modules.
@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from ae29bba to 92cddf5 Compare January 16, 2025 04:19
@ERosendo ERosendo marked this pull request as ready for review January 16, 2025 05:59
@ERosendo ERosendo requested a review from mlissner January 16, 2025 05:59
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a once-over and it feels about right. Concerns I'll highlight for you guys to consider:

  1. Memory: We're putting the CSV in memory, which sure is handy. I think this is fine b/c it'll be pretty small, a couple hundred KB, right? This must be fine, but it's on my mind.

  2. The fields in the result might be annoying with columns that aren't normalized to human values (like SOURCE: CR or something, and local_path: /recap/gov.xxxx.pdf instead of https://storage.courtlistener.com/recap/gov.xxx.pdf). I didn't see code to fix that, but it's probably something we should do if we can. This CSV is supposed to be for humans, in theory.

I appreciate the refactor, but I'd suggest it in a separate PR in the future, so it's not mixed in.

But this looks about right to me otherwise. :)

@mlissner mlissner assigned albertisfu and unassigned mlissner Jan 17, 2025
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ERosendo this looks good and is on the right track. I’ve left some comments and suggestions in the code, along with additional feedback here:

  • In addition to Mike's comment about normalizing values for humans, a similar suggestion is that I noticed that the CSV headers don’t maintain a fixed order when the CSV is generated and I found it difficult to determine when the results belong to the same "Case," particularly when matching child documents. It might be a good idea to ensure that the headers are fixed for each search type and to prioritize key headers that help identify whether the results belong to the same case. For instance, in RECAP, the headers could start like this:

    docket_id, docket_number, pacer_case_id, court_exact, case_name, document_number, attachment_number, ...

  • Highlighted fields in the results are always represented as a list of terms, even though we are only highlighting a single fragment. Most of the HL fields are not naturally lists of terms. However, some fields, such as citations in case law, can be lists and are also highlighted.

    Currently, these fields are rendered as lists:
    Screenshot 2025-01-20 at 6 44 17 p m

As in the frontend, perhaps you could use the render_string_or_list filter or a modified version of it to render HL fields as strings instead of lists when they are not multi-fields?

  • Regarding Judge Search, I noticed that the CSV only contains fields from PersonDocument, and some fields currently rendered as flat fields in the frontend (such as "Appointers" and other similar fields extracted from the database in merge_unavailable_fields_on_parent_document) are not included. Is this behavior expected?

  • I'd recommend adding at least one integration test to help catch and prevent regressions in the future related to the suggestions and bugs mentioned in the comments.

Thank you!

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Jan 21, 2025
Introduces `is_csv_export` to `do_es_search`, allowing retrieval of all results up to `MAX_SEARCH_RESULTS_EXPORTED` for CSV exports, bypassing the `RECAP_SEARCH_PAGE_SIZE` limit. Refactors `fetch_es_results_for_csv` to remove the redundant while loop.
@ERosendo
Copy link
Contributor Author

ERosendo commented Mar 5, 2025

@albertisfu Thanks for your comments. This PR is ready for another review.

Here's a summary of the changes implemented to address your comments.

  • To address your comment regarding the CSV header order, I've implemented a class method(get_csv_headers) that returns a predefined list of keys to use as headers.

  • While addressing your suggestion regarding the CSV header order, I found that the ES document classes use a mix of camel case (e.g., caseName) and snake case (e.g., absolute_url) for attribute names. To improve the clarity and consistency of the CSV for users, I've introduced a helper in commit 5f469ca that converts all keys to snake case.

  • As you pointed out, our logic to flatten results was overwriting parent information with child data, which was problematic for highlighted fields. I've resolved this by changing the merge logic. Previously, parent data was the base, and child data was added on top. Now, child data is the base, and parent data is added. This ensures highlights in parent documents are preserved, as they don't conflict with child fields.

  • I also added a class method that returns a dictionary with basic data transformations. We'll use this to prepare search results for the CSV file, making the data human-readable and rendering highlighted fields with the render_string_or_list helper.

  • I've updated the do_es_search helper method to add the argument you suggested and got rid of the while loop that was working around the recap search result limit. I also checked if we could use that argument to skip calculating counts, but it would mean changing two other methods. It looks like those ES methods are pretty connected. Maybe we could think about separating them a bit, and then do a separate refactoring to avoid calculating counts later.

Let me know what you think.

@ERosendo ERosendo requested a review from albertisfu March 5, 2025 04:00
@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Mar 5, 2025
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ERosendo! This looks good! I just have a few additional suggestions.
One thing I noticed is that rendering dockets with no documents still needs a couple tweaks, as mentioned in comments in detail.

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Mar 7, 2025
@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from 50fbe52 to 3f324c6 Compare March 11, 2025 16:00
@ERosendo ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from 2a84a76 to 503453d Compare March 11, 2025 17:05
@ERosendo ERosendo requested a review from albertisfu March 11, 2025 17:06
@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Mar 11, 2025
Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ERosendo this looks great now and ready to go!

Just a note from our conversation: When we work on adding the button to the UI for requesting the results to be downloaded, we should implement a validation to ensure the button is only shown for valid searches that return results. This will help avoid triggering the task unnecessarily and retrying the task on queries with syntax errors.

Set to auto-merge!

@albertisfu albertisfu merged commit c22c494 into main Mar 11, 2025
15 checks passed
@albertisfu albertisfu deleted the 599-feat-add-logic-to-download-results branch March 11, 2025 22:39
@mlissner
Copy link
Member

Woo, congrats guys on getting this one done. It was harder than it seemed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants