-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about computational requirements #9
Comments
I havent tried to calculate the Disruption Index for all papers in MAG yet, but for the subset of physics papers it took a little less than a day. if show_progress = True, then after a few iterations, the progress bar should give you an estimate on the total wall time. Please let us know if you're successful! |
BTW, I think I have implemented the Disruption Index calculation differently. I am not sure which is the correct way. For finding the citations to the focal paper's references, which I believe is In addition, is it possible to restrict the citation window to within N years after the focal paper? I have tried to implement this on a smaller dataset and it takes about 8 hours to calculate the DI for 1 million papers. It seems to involve multiple joins between tables in MAG, so I am worried about the performance. |
Hi @tomleung1996, if you have ideas for a more efficient implementation of the Disruption Index, then please share! I tried out a few ideas myself before settling on this one as the best way to do it in bulk. For the filtering step, I usually view this as a pre-processing decision, and not something inherent to the Index calculation itself. So I would pre-filter the pub2ref_df (ie remove citations with earlier publication dates, or that occur after the time window) before passing it to the function. But its a great point to make that more explicit! |
I am not sure if this is normal. I ran the example code to show the number of publications each year but it consumes over 150GB of memory (even more than pySpark standalone mode), with I have my own slow slow implementation of DI with pySpark and SparkSQL, but it was stuck at the final stage (maybe the table joins were too slow). However, SparkSQL is faster in counting the annual publication numbers. |
I am wondering if I can calculate the Disruption Index for all papers in MAG using
pyscisci
? The server I used for this task has 2x32 cores CPU and 350GB RAM. If so, what is the expected time to finish?The text was updated successfully, but these errors were encountered: