Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about computational requirements #9

Open
tomleung1996 opened this issue Jul 27, 2021 · 4 comments
Open

Question about computational requirements #9

tomleung1996 opened this issue Jul 27, 2021 · 4 comments

Comments

@tomleung1996
Copy link

I am wondering if I can calculate the Disruption Index for all papers in MAG using pyscisci? The server I used for this task has 2x32 cores CPU and 350GB RAM. If so, what is the expected time to finish?

@ajgates42
Copy link
Collaborator

I havent tried to calculate the Disruption Index for all papers in MAG yet, but for the subset of physics papers it took a little less than a day. if show_progress = True, then after a few iterations, the progress bar should give you an estimate on the total wall time. Please let us know if you're successful!

@tomleung1996
Copy link
Author

BTW, I think I have implemented the Disruption Index calculation differently. I am not sure which is the correct way.

For finding the citations to the focal paper's references, which I believe is cite2ref in your implementation, I filter out those published before the focal paper (Please see Fig. 1a in the paper Large teams develop and small teams disrupt science and technology).

In addition, is it possible to restrict the citation window to within N years after the focal paper? I have tried to implement this on a smaller dataset and it takes about 8 hours to calculate the DI for 1 million papers. It seems to involve multiple joins between tables in MAG, so I am worried about the performance.

@ajgates42
Copy link
Collaborator

Hi @tomleung1996, if you have ideas for a more efficient implementation of the Disruption Index, then please share! I tried out a few ideas myself before settling on this one as the best way to do it in bulk.

For the filtering step, I usually view this as a pre-processing decision, and not something inherent to the Index calculation itself. So I would pre-filter the pub2ref_df (ie remove citations with earlier publication dates, or that occur after the time window) before passing it to the function. But its a great point to make that more explicit!

@tomleung1996
Copy link
Author

I am not sure if this is normal. I ran the example code to show the number of publications each year but it consumes over 150GB of memory (even more than pySpark standalone mode), with keep_in_memory=False. It takes about 13 mins to finish this task and I think it doesn't take the advantage of a multi-core CPU (the CPU usage is below 100% in the top command)?

I have my own slow slow implementation of DI with pySpark and SparkSQL, but it was stuck at the final stage (maybe the table joins were too slow). However, SparkSQL is faster in counting the annual publication numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants