-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
work adding import functionality using django-import-export #800
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #800 +/- ##
===========================================
+ Coverage 97.51% 97.54% +0.02%
===========================================
Files 229 229
Lines 12859 13012 +153
Branches 81 81
===========================================
+ Hits 12540 12692 +152
- Misses 316 317 +1
Partials 3 3 |
This is now working. It relies on the imported CSV having 3 columns: name, gender, nationality. But it appears to be fairly slow. Here is a sample of the output:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration looks reasonable from what I can tell. Comments are good on the places you're customizing the import. I wonder if it would make more sense to do that in the spreadsheet instead of in code, but don't feel strongly about it.
dev-requirements.txt
Outdated
@@ -7,3 +7,4 @@ sphinx | |||
wheel | |||
pre-commit | |||
wagtail-factories | |||
django-import-export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to go in the main requirements (dev reqs aren't installed on the servers)
mep/people/admin.py
Outdated
|
||
class PersonAdminWithImport(PersonAdmin, ImportExportModelAdmin): | ||
resource_class = PersonResource | ||
change_list_template = "templates/admin/people/person/change_list.html" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the default? do we need to customize for some reason?
model = Person | ||
fields = PERSON_IMPORT_COLUMNS | ||
import_id_fields = ('slug',) | ||
export_order = PERSON_IMPORT_COLUMNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now export only exports the columns that are imported. We can change this (though there's annoyingly not a simple way to do this it appears in dj-imp-exp) or just document that for now. For instance if we wanted to annotate profession we could add that to both and write a little profession = Field(...ForeignKey(..))
in the code above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like an ok limitation since that wasn't part of the scope you were working on
(seems like a weird limitation on their part, but something we can live with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this working in latest, we export most everything we display in the Person table, and import only what we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I'm glad that django-import-export does what we need.
You should add unit tests for:
- `before_import_row
- import formatting methods for gender and nation (you'll need to adjust the nesting)
- get_import_fields
Person.save
optional behavior (skip viaf lookup)
It would be nice to have a couple of integration tests that run through the django-import-export code so that we can automatically test updates in future. I don't see anything in their docs about testing code that uses django-import-export, but one approach would be a view test with a django admin client, post a small import dataset and do some light testing to confirm it makes expected changes.
mep/people/admin.py
Outdated
|
||
def fmt_nation(x): | ||
x=str(x).strip() | ||
if not x or x=='[no country]': return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an existing record for '[no country]'; was that not matching for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, whew—I thought this was our own annotation and that it had auto-created a [no country] where null or empty string was used.
model = Person | ||
fields = PERSON_IMPORT_COLUMNS | ||
import_id_fields = ('slug',) | ||
export_order = PERSON_IMPORT_COLUMNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like an ok limitation since that wasn't part of the scope you were working on
(seems like a weird limitation on their part, but something we can live with)
mep/people/admin.py
Outdated
|
||
def after_import(self, *args, **kwargs): | ||
super().after_import(*args, **kwargs) | ||
# reconnect indexing signal handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition to reconnecting the indexing signal handler, you need to actually index the records that were modified - is there a way to get a list of the person records from django-import-export?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax is something like Person.index_items(people)
(where people
is an iterable of Person
records to index).
Here's an example of something similar:
mep-django/mep/people/models.py
Line 322 in 0452271
ModelIndexable.index_items(members) |
mep/people/models.py
Outdated
VIAF_SIGNAL_ACTIVE=True | ||
class ViafSignalHandler: | ||
def disconnect(): | ||
global VIAF_SIGNAL_ACTIVE | ||
VIAF_SIGNAL_ACTIVE = False | ||
def connect(): | ||
global VIAF_SIGNAL_ACTIVE | ||
VIAF_SIGNAL_ACTIVE = True | ||
def is_connected(): | ||
return VIAF_SIGNAL_ACTIVE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signal handler connect/disconnect wording doesn't actually make sense for this since the viaf save method isn't triggered by a signal. If you can't come up with something better, you could just simplify this to a global variable that allows you to bypass the viaf logic in the person model save method. (Actually, using django settings would probably be better - more django-nic? - than a global variable).
I see that django-import-export has a bulk mode - that would simplify things for us because I think it would automatically bypass the viaf call on Person and would also not trigger the indexing signals. Is that not an option here because of the many-to-many nationality relationship?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that not an option here because of the many-to-many nationality relationship?
Yes, that's my sense from the docs and a quick test.
I'll look into alternatives re: global variable, etc. Right now the solution above is effectively a verbose global variable; let me see what else is doable without too much more time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for expediency let's just use django settings https://docs.djangoproject.com/en/5.0/topics/settings/
You can set a new setting, e.g. settings.SKIP_VIAF_LOOKUP
And then you can check if it's true but default to false if it's not set (use getattr
).
Backing up, though - are you importing VIAF ids for Person records that don't have birth and death date set? Don't we want those to be populated? Is there a bug in the logic that checks if the viaf lookup needs to run, when it's running when it shouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I'm only importing gender and nationality, I don't believe other fields were changed by our annotators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, interesting - did some quick investigation with my local data, I think we must have viaf set for a lot of authors where we don't have birth year and death year set - maybe because they aren't available in viaf. (I was using Person.objects.exclude(viaf_id="").filter(birth_year__isnull=True, death_year__isnull=True)
to look, but my data isn't very recent).
I wonder if the viaf logic is broken - some of these authors do seem like they should have birth/death years we can get from viaf, but the lookup isn't working. I also see that we're hitting the viaf API twice instead of once. 🙄
None of those are your problem here - bypassing for import still seems reasonable, but we might want to discuss with Josh - seems like author birth and death dates would be pretty useful information to have in the author demographics data we're going to be exporting. Maybe we can add a manage command to populate dates from viaf in bulk (and figure out why it isn't setting values properly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a little more digging - we have viaf IDs ending with slashes but the RDF has URIs ending without the slash, so our lookups don't match and we don't get the dates we should be getting. (Maybe something changed with VIAF since the time this was implemented, or maybe we changed our local practice.). We can fix that part in the S&co codebase.
I also see that the date parsing doesn't handle negative years... That needs to be fixed in viapy.
mep/people/admin.py
Outdated
x=str(x).strip() | ||
if not x or x=='[no country]': return | ||
return x | ||
row['nation'] = fmt_nation(row.get('nation')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the natural foreign key option get us anything here? (i.e., could we use that and get rid of the custom format could?)
It looks like we haven't implemented that for Country
, but it isn't hard to add and would be valuable for other things.
Thanks for the review! This all makes sense—hoping to finish this Fri. |
Okay, the indexing (of only updated objects) at end of import + including all listed-in-table export fields while importing only when we have logic for currently. I'll push tests soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the formatting I think you may not have pre-commit hooks installed; please install and make sure all of your new changes are blackened (etc) before we merge this.
Integration test is good for coverage but if something goes wrong a unit test would be more helpful to figure out where the issue is.
I think we can call it good once you add a test for the new behavior in the person save method. (And once the tests are all passing - do you know why they are not?)
mep/people/admin.py
Outdated
values into django-import-export lookup logic. | ||
""" | ||
# just make sure nation has no string padding | ||
row['nation'] = str(row.get('nation')).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does django-import-export not handle this automatically?
I see "Strip whitespace when looking up ManyToMany fields (#668)" in the changelog for version 0.6 although not finding results elsewhere in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call, this should be done automatically
mep/people/admin.py
Outdated
""" | ||
Called when an instance either was or would be saved (depending on dry_run) | ||
""" | ||
self.objects_to_index.append(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surprised you need to collect these yourself, seems like something django-import-export should handle for you
can you use this 'store instance' option instead? https://django-import-export.readthedocs.io/en/latest/advanced_usage.html#access-full-instance-data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that. That gets us the instance object on after_import_row
and after_save_instance
, i.e. gives us the instance of a row as it's imported or saved, but does not store all the saved/updated/imported instances for access later. That means without storing the instances somewhere ourselves we can't use batch indexing, only index one object at a time.
mep/people/admin.py
Outdated
super().after_import(dataset, result, using_transactions, dry_run, **kwargs) | ||
|
||
# report how many need indexing | ||
print(f'indexing {len(self.objects_to_index)} objects, dry_run = {dry_run}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure you clean this up before merging (converting to debug logging would be fine if you think it may be helpful in future)
mep/people/admin.py
Outdated
Specifies the resource class to use for exporting, | ||
so that separate fields can be exported than those imported | ||
""" | ||
return ExportPersonResource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! glad you figured out a solution
mep/people/tests/test_admin.py
Outdated
self.url_person_import = '/admin/people/person/import/' | ||
self.url_person_process_import = '/admin/people/person/process_import/' | ||
self.url_person_export = '/admin/people/person/export/' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use named urls for these (weirdly not documented in django-i-e docs that I could find)
I think this should work:
self.url_person_import = '/admin/people/person/import/' | |
self.url_person_process_import = '/admin/people/person/process_import/' | |
self.url_person_export = '/admin/people/person/export/' | |
self.url_person_import = reverse("admin:people_person_import") | |
self.url_person_process_import = reverse("admin:people_person_process_import") | |
self.url_person_export = reverse("admin:people_person_export") | |
mep/people/tests/test_admin.py
Outdated
password = 'adminpass' | ||
self.admin_user = User.objects.create_superuser('admin', 'admin@admin.com', password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had code like this flagged for including passwords in source code (even though it's test code). Was looking elsewhere to see what we've done in this code base - in newer projects we're using pytest-django with an admin client fixture but I don't think that's easy to use here.
One workaround (used elsewhere in mep-django) has been to generate a password so it isn't a hard-coded string. (You'll have to import uuid for this to work)
password = 'adminpass' | |
self.admin_user = User.objects.create_superuser('admin', 'admin@admin.com', password) | |
password = str(uuid.uuid4()) | |
self.admin_user = User.objects.create_superuser('admin', 'admin@admin.com', password) |
self.assertEquals(getstr(person,attr), row[attr]) | ||
|
||
|
||
def _djangoimportexport_do_import_post( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments to make clear what this helper function is doing
mep/people/tests/test_admin.py
Outdated
|
||
# New tests: | ||
# `before_import_row | ||
# import formatting methods for gender and nation (you'll need to adjust the nesting) | ||
# get_import_fields | ||
# Person.save optional behavior (skip viaf lookup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to clean up before merging.
mep/settings.py
Outdated
SKIP_VIAF_LOOKUP = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed since you're assuming false if it's unset, let's omit
SKIP_VIAF_LOOKUP = False |
Ok, I should have responded to all these comments and added a unit test for the Person save method. I'm not sure why these 4 tests elsewhere are now breaking: These two:
Are failing at:
These two are missing sort_name, but not sure why these are affected:
This is where they fail:
Any hints on these? |
mep/people/admin.py
Outdated
if not dry_run: | ||
# re-enable indexing | ||
IndexableSignalHandler.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be this causing 1 or 2 of the failing tests? The signal handler ought to revert to its original condition (connected) but maybe it's not somehow and we need to ensure that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh good catch, I bet you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i should revert index signal status in a tearDown
method on the testing class. and/or revert to previous value instead of assuming it was previously connected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible confirmation - the tests pass when you run just the ones that are failing but fail when you run the whole test suite (i.e., the problem is the interaction between the tests as you've already figured out)
I don't think we currently have a way for you to know what the original condition was! (yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quadrismegistus quick pr on parasolr to update the disconnect method to return a count of handlers disconnected; could use this to determine whether to reconnect Princeton-CDH/parasolr#84
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it is pretty likely a test-only scenario, so your other solution may be simpler + sufficient
mep/people/admin.py
Outdated
# make sure indexing disconnected afterward | ||
IndexableSignalHandler.disconnect() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in after_import()
seems to fix all remaining errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because after import is not run by the test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. We just needed to make sure that the indexer is disconnected after we potentially use it in that same function. I thought it was best to leave it on but it's best to leave it off it seems. The other tests likely handle it on their own?
@quadrismegistus I had some cleanup on the viaf lookup save test, more complicated than I wanted to use suggest changes so I went ahead and pushed to your branch, hope you don't mind. I'm seeing deprecation warnings from django-import-export when I run the tests, are you seeing these? |
Specifically: |
This should be fixed now:
|
@quadrismegistus thanks for addressing the relevant deprecation warnings. (totally understand if you didn't notice, there are several others that I need to deal with!) |
No description provided.