Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse gff notes #26

Open
seankim658 opened this issue Dec 11, 2024 · 0 comments
Open

Parse gff notes #26

seankim658 opened this issue Dec 11, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@seankim658
Copy link
Member

seankim658 commented Dec 11, 2024

The main optimization points I can see (not including more advanced things like hooking into the gffutils sqlite db and creating custom queries, etc) are string operations and excessive file IO operations.

Calling clean_chr_id for every row adds unnecessary (and redundant) string operations, which are expensive operations (especially so for Python). Could perform the string operation once and cache the result:

chr_id_cache = {}
def clean_chr_id(chr_id):
    if chr_id not in chr_id_cache:
        chr_id_clean = chr_id.lstrip("chr")
        chr_id_clean = "23" if chr_id_clean == "X" else "24" if chr_id_clean == "Y" else chr_id_clean
        chr_id_cache[chr_id] = chr_id_clean
    return chr_id_cache[chr_id]

Or if these are the only two options, could just create the map directly and use it.

Script writes to the output file row by row, for large files this incurs frequent I/O operations (which relative to cpu operations are very expensive). Would batch the rows and then dump at a specified max buffer size (1,000 is just an example, might want to play around with larger values):

# list to hold the batched rows 
buffer = []
max_buffer = 1000
for row in reader:
    buffer.append(processed_row)
    # dump to output file when buffer limit is reached 
    if len(buffer) >= max_buffer:
        writer.writerows(buffer)
        buffer = []
# Write remaining rows
if buffer:
    writer.writerows(buffer)

Similarly, can explore batch reading the input file.

Line 64 smaller fix but if repeated thousands of times might add up, use string slicing instead of replace for better time complexity:

chrom = chrom[:3]

From you log file snippet you sent me, set the log level for per row operations to logging.debug. That level of verbosity shouldn't be captured for that many operations on normal execution (avoid super redundant file io as much as possible). That way if you need verbose log statements you can run the script with level=logging.DEBUG but on normal execution you won't incur the file IO overhead. If you want some progress logging you can implement a simple log checkpoint. Depending on how many records you are expecting to process, would do something like:

log_checkpoint = 10_000
if i % log_checkpoint == 0: 
    logging.info(f"Hit log checkpoint at row {i + 1} of {total_rows}...")
@seankim658 seankim658 added the enhancement New feature or request label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants