Some languages are better suited for certain tasks. But which ones and for which tasks?
A common bioinformatics task is to cross-reference the names from two or more files. At the outset, you must know if the relationships are one-to-one, one-to-many, or many-to-many.
- The names are unique in both lists
- The names occur once in one list, but multiple times in the other list
- The names occur multiple times in both lists
If it takes longer than 1 second, you're probably doing it wrong
Let's make some example lists with the makelists.py
program. This creates a
master list of names names.txt
, a subset containing unique names and values
unique.txt
, and a subset containing names associated with multiple values
multi.txt
. The following command makes 1000 names for the names.txt
file.
Approximately 10% end up in unique.txt
and multi.txt
but multi is larger
because it has multiple values for each token.
python3 makelists.py 1000
ls -lrth
wc *.txt
less *.txt
The grep_merge1.sh
file has the following contents.
1 while read name index; do
2 if found=`grep $name unique.txt`
3 then
4 echo $found $index
5 fi
6 done < names.txt
Lines 1-6 performa a while
loop that iterates over the names.txt
file as
STDIN.
Line 1 splits each line into variables name
and index
.
Line 2 performs a grep
looking for the name in unique.txt
. If successful,
the found
variable contains the line.
Line 4 prints the grep-matched line and the index from line 1.
The grep
solution is simple and fast to type. For small files, this works
okay, but the overall solution is both slow and wasteful for 2 reasons.
grep
is called many timesgrep
is a linear search
Let's examine what happens if the list is twice as large.
python3 makelists.py 2000
time sh grep_merge.sh
Not too surprising: it takes twice as long. That's what it means for a search
to be linear. The way the script is constructed, the outer while
loop is the
larger file and the inner grep
is the smaller file. What happens if we
reverse the two?
time sh grep_merge2.sh
That's clearly a lot faster. Why? It's the same number of comparisons either
way. Every line of one file must be compared to every line of the other file.
However grep_merg2.sh
runs grep
fewer times. Every time you run a program,
there is overhead from the operating system. Let's imagine a case with 10,000
names.
python3 makelists.py 10000
time sh grep_merge2.sh
It takes my computer around 5.9 seconds to cross-reference 10,000 names with another 1,000 names, or about 1,724,137 comparisons per second. That may sound fast, but if each list was 1 million items long, it would take about 1 week.
If your data is unique, it can be sorted and searched efficiently with a binary
search. First, you must sort
the data, then search it with look
.
sort names.txt > names_sorted.txt
while read name index; do
if found=`look $name names_sorted.txt`
then
echo $found $index
fi
done < unique.txt
Despite the overhead of having to sort the file initially, the look
strategy
is faster.
time sh look_merge.sh
Even though look
is better than grep
, if you find yourself wanting to go
faster than grep
, don't use look
. There are better solutions than running
hundreds or thousands of shell commands.
Here are several Python solutions. They are all faster than grep
but they are
still linear searches. MCS is millions of comparisons per second.
Program | MCS | Notes |
---|---|---|
grep_merge1.py |
0.30 | outer file large |
grep_merge2.py |
1.72 | outer file small |
python_linear1.py |
2.07 | outer file large |
python_linear2.py |
2.52 | outer file small |
python_linear3.py |
55.13 | memorize a file |
Dictionaries are sort of like binary searches in performance. As the size of
the data grows, dictionaries increasingly out-perform lists. In the table
below, the colums are the sizes of the names.txt
file.
Program | 1e3 | 1e4 | 1e5 | 1e7 |
---|---|---|---|---|
grep_merge1.py |
3.129 | 34.02 | 57 min | 4 days |
grep_merge2.py |
0.355 | 5.881 | 10 min | 16 hrs |
python_linear1.py |
0.119 | 4.603 | 8 min | 13 hrs |
python_linear2.py |
0.088 | 4.234 | 7 min | 12 hrs |
python_linear3.py |
0.045 | 0.223 | 18.33s | 31 min |
python_dict.py |
0.044 | 0.047 | 0.102s | 0.674s |
for another day...
- one to many
- many to many
- dataframe comparisons...
- sql comparisons...
- split()
- csv.reader()
Some files are meant to be human readable and not much thought has gone into making them easy to read by machines.
Perl regex and dictionary are super easy.
Backticking in Perl is quick-n-dirty.