Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SCons to Handle Big (>200MB) Files #1646

Closed
bdbaddog opened this issue Jan 2, 2018 · 0 comments
Closed

Enable SCons to Handle Big (>200MB) Files #1646

bdbaddog opened this issue Jan 2, 2018 · 0 comments
Labels
Milestone

Comments

@bdbaddog
Copy link
Contributor

bdbaddog commented Jan 2, 2018

This issue was originally created at: 2007-04-21 13:25:48.
This issue was reported by: synx.
synx said at 2007-04-21 13:25:48

SCons collects the hash of a file by reading that file's contents entirely to
memory, and then taking the hash of that memory. That is poor programming style
though, and it ceases to work when a file on disk is to large to read entirely
into memory. Instead, you should read a file piece by piece, and feed those
pieces into a hash object, taking the digest at the end. It will result in the
same hash, but only 4096 bytes of memory are needed, instead of 690,169,856
bytes for instance.

I'll try to attach a patch from the latest SVN. It works on my project over
here, but I don't have the mojo to figure out aegis and tests and stuff. Seems
rather gestapo to me. But a sufficient test would be to make a large file, then
try to depend on it. If my patch works, then scons will operate normally and use
that file as a dependancy. (It might take a while to hash 900MB or whatnot.)
Against the baseline scons, it will drag your machine to a screaming halt, spill
into your virtual memory, and stop responding to signals.

synx said at 2007-04-21 13:26:33

Created an attachment (id=182)
The patch for big files

synx said at 2007-04-21 13:27:20

Created an attachment (id=183)
A test to make sure this patch works... I guess...?

stevenknight said at 2007-04-21 20:35:45

Thanks for the patch (and test case). Generators and list comprehensions
weren't introduced until Python 2.2, so we really can't even look at integrating
the patch as-is into the main code base until we drop support for 1.5.2 and 2.1
(probably within the next 3-6 months or so). But we can revisit this at that point.

The logic will likely have to move to some other function than get_contents(),
or be refactored in some other way, since the same method is used to fetch
contents for scanning. Smaller files get a big performance win from reading the
file into memory once and using it for both the scan and the MD5 signature,
whereas a generator is going to read the contents twice if the file is something
that's also going to be scanned for implicit dependencies.

So doing this correctly to handle big files (which we obviously want to do) is
eventually going to require a more sophisticated approach than just throwing a
generator underneath the get_contents() method. It might be something simple
like setting a file-size threshold for when the generator is used. Hard-coding
a threshold, though, would almost certainly require making it configurable; I
haven't yet met an arbitrary hard-coded value that wasn't incorrect for
someone out there...

gregnoel said at 2008-06-09 22:24:43

Bug party triage. To be done as part of Ludwig's GSoC project; finding a good
balance between memory usage and speed.

pankrat said at 2008-06-14 04:15:04

To avoid generators, the MD5 computing function must be closer coupled to the
file reading function. The proposed patch introduces 'MD5filesignature' which
reads a file chunk by chunk to compute the signature. Node.File.FS 'get_csig'
does not call 'get_content' anymore, but just opens the associated file and
feeds it to MD5filesignature.

Tested with Python 2.5, 2.4, 2.3, 2.2, and 1.5.2.

It may be problematic if a subclass of Node.FS.File relies on 'get_csig' to call
'get_contents', though.

pankrat said at 2008-06-14 04:18:21

Created an attachment (id=431)
Rewritten original patch avoiding generators

pankrat said at 2008-07-24 08:01:27

Created an attachment (id=453)
Enhanced patch with user configurable chunk-size and test case

gregnoel said at 2008-07-26 10:31:31

Review and retriage patch.

gregnoel said at 2008-07-29 05:58:13

Bug party triage. Nag Steven to review the code, then put it in as soon as
we're out of 1.0.x.

pankrat said at 2008-07-29 08:48:54

*** Issue 1459 has been marked as a duplicate of this issue. ***

gregnoel said at 2008-09-09 15:03:21

Bug party triage.

pankrat said at 2008-09-28 06:06:00

Patch applied in revision 3514.

synx attached scons.bigfiles.patch at 2007-04-21 13:26:33.

The patch for big files

synx attached scons.bigfiles.test.sh at 2007-04-21 13:27:20.

A test to make sure this patch works... I guess...?

pankrat attached chunk_hash.patch at 2008-06-14 04:18:20.

Rewritten original patch avoiding generators

pankrat attached md5_chunk.patch at 2008-07-24 08:01:26.

Enhanced patch with user configurable chunk-size and test case

pankrat said this issue is duplicated by #1459 at 2008-07-29 08:48:55.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant