You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main optimization points I can see (not including more advanced things like hooking into the gffutils sqlite db and creating custom queries, etc) are string operations and excessive file IO operations.
Calling clean_chr_id for every row adds unnecessary (and redundant) string operations, which are expensive operations (especially so for Python). Could perform the string operation once and cache the result:
Or if these are the only two options, could just create the map directly and use it.
Script writes to the output file row by row, for large files this incurs frequent I/O operations (which relative to cpu operations are very expensive). Would batch the rows and then dump at a specified max buffer size (1,000 is just an example, might want to play around with larger values):
# list to hold the batched rows buffer= []
max_buffer=1000forrowinreader:
buffer.append(processed_row)
# dump to output file when buffer limit is reached iflen(buffer) >=max_buffer:
writer.writerows(buffer)
buffer= []
# Write remaining rowsifbuffer:
writer.writerows(buffer)
Similarly, can explore batch reading the input file.
Line 64 smaller fix but if repeated thousands of times might add up, use string slicing instead of replace for better time complexity:
chrom=chrom[:3]
From you log file snippet you sent me, set the log level for per row operations to logging.debug. That level of verbosity shouldn't be captured for that many operations on normal execution (avoid super redundant file io as much as possible). That way if you need verbose log statements you can run the script with level=logging.DEBUG but on normal execution you won't incur the file IO overhead. If you want some progress logging you can implement a simple log checkpoint. Depending on how many records you are expecting to process, would do something like:
log_checkpoint=10_000ifi%log_checkpoint==0:
logging.info(f"Hit log checkpoint at row {i+1} of {total_rows}...")
The text was updated successfully, but these errors were encountered:
The main optimization points I can see (not including more advanced things like hooking into the gffutils sqlite db and creating custom queries, etc) are string operations and excessive file IO operations.
Calling
clean_chr_id
for every row adds unnecessary (and redundant) string operations, which are expensive operations (especially so for Python). Could perform the string operation once and cache the result:Or if these are the only two options, could just create the map directly and use it.
Script writes to the output file row by row, for large files this incurs frequent I/O operations (which relative to cpu operations are very expensive). Would batch the rows and then dump at a specified max buffer size (1,000 is just an example, might want to play around with larger values):
Similarly, can explore batch reading the input file.
Line 64 smaller fix but if repeated thousands of times might add up, use string slicing instead of replace for better time complexity:
From you log file snippet you sent me, set the log level for per row operations to
logging.debug
. That level of verbosity shouldn't be captured for that many operations on normal execution (avoid super redundant file io as much as possible). That way if you need verbose log statements you can run the script withlevel=logging.DEBUG
but on normal execution you won't incur the file IO overhead. If you want some progress logging you can implement a simple log checkpoint. Depending on how many records you are expecting to process, would do something like:The text was updated successfully, but these errors were encountered: