Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual memory std:bad_alloc in 32bit mode #10

Open
DTL2020 opened this issue Jan 11, 2025 · 28 comments
Open

Virtual memory std:bad_alloc in 32bit mode #10

DTL2020 opened this issue Jan 11, 2025 · 28 comments

Comments

@DTL2020
Copy link
Collaborator

DTL2020 commented Jan 11, 2025

If attempt to compile 32bit binary and use
JincResize(2000,2000):
For 2000x2000 output frame size: const int coeff_per_pixel = 112 (bytes)
and std::bad_alloc crash (C++ runtime exception) at

tmp_array.reserve(static_cast<int64_t>(dst_width) * dst_height * coeff_per_pixel);

where 2000x2000x112 = about 448 MB attempt to allocate+reserve at 32bit process and its fail. Looks like not enough contigous virtual address space left in that 32bit process total address space (and Windows memory manager too lazy to make defragmentation if it even possible).

This issue looks like very rare in 64bit builds but also possible (in theory). Are there any ways to fix it (without significant re-write to use lower sized memory blocks/objects) ?

No ideas how to workaround it easy enough. May be at least add internal C++ bad_alloc exception catch inside plugin and throw formatted error message about current kernel size and output frame size too large for used memory model ?

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 12, 2025

The old versions of JincResize design also suffer form same memory issues - at least partially (with same Evaluate: Unhandled C++ Exception) - AviSynth/jinc-resize#2

Current idea on (partial) solution:

One possible partial solution without too much redesign of processing - split one full-frame array of coefficients into array of arrays for each output line. With separate memory allocation for each output line. If everything will go as expected - it will keep compatibility with current SIMD datawords loading from memory. The only redesign of SIMD processing functions is to update pointer to array of coefficients before each new output line start.

In theory it will relax requirement of single contiguous buffer allocation in 32bit systems and allow to use all available free memory in 32bit environment.

@Asd-g
Copy link
Owner

Asd-g commented Jan 12, 2025

With JincResize(2000,2000) I don't have issues. But increasing more the target dimension, the limit will be hit.
2000*2000*112*4 = ~1.67GB.
I started to clean the code a bit but generally I don't think it's worth it to make anything just to keep 32-bit version to work for specific case. Generally 32-bit environment isn't suitable for working with large resolution videos.

@filler56789
Copy link

@Asd-g — I agree that it wouldn't be worth the hassle to make the plugin "more 32-bit friendly", so to speak.
But I strongly recommend that from now on, this issue /limitation is mentioned and described in the README file.
As I said in the topic that I opened @ doom9-dot-org, the 32-bit limitations of the plugin ARE OLD, but they had never been mentioned before, neither in the wikis, nor in the READMEs. Proper documentation matters.

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 13, 2025

"20002000112*4 = ~1.67GB."

Oh I forgot about 4bytes size of float32. Also the size of table entry is greatly depends on kernel size. The JincResize() is equal to lowest kernel of JincResize36 and if user wants to use larger kernel JincResize256 the limits hits much faster (as square of kernel size ?).

One possible logical optimization about doubling possible memory size at preparation of coefs table: Currently computing of coefs table performed in tmp_array defined at

std::vector<float> tmp_array;

And at the end of function this unaligned growing array is simply copied to aligned working array allocated at

out->factor = static_cast<float*>(aligned_malloc(tmp_array_size * sizeof(float), 64)); // aligned to cache line

So at time of copy preparation OS must allocate 2 large memory arrays and it may fail faster in 32bit address space., Also possible memory allocation fail is by C function not checked and pointer is used as always good.

As logical optimization the computing of coefs may be performed directly into single global use aligned memory area allocated before computing of coefs (its size is possible to compute before start of computing loops). This expected to about double possible memory for larger combination of output frame size and 2D kernel size used. Not very large but sometime may helps.

Also it is good to throw formatted error messages to AVS user about current memory limitations for current output size and kernel size requested (and also it exact program start attempt). Simple crash or Unhandled C++ Exception looks like a bug in software and can not describe the real issue. In this way the software will be self-documented at runtime. The memory limitations in 32bit mode are very dynamic and depend on all previous memory allocations in process address space and the AVS plugin is called at the end of calling application init and AVS core init and possible many other plugins init. And residual memory resources are greatly depend on all previous allocations.

@filler56789
Copy link

filler56789 commented Jan 13, 2025

P.S.:
«Generally 32-bit environment isn't suitable for working with large resolution videos.»
While that is generally true, you really should not use it as the reason for not caring about possible 32-bit-friendly "improvements". One should clearly state that, unlike the other well-known Avisynth resizers, JincResize is VERY «resource-intensive», which indeed puts limits to its practical use in 32-bit applications. Just as an example, my 32-bit Avisynth+ has no problems at all with "Spline64Resize(4096,2304)", but it doesn't like a simple "Jinc256Resize(640,360)"
{{ ERROR_MESSAGE = JincResize: internal error; filter size '50' is not supported }}.

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 13, 2025

JincResize is VERY «resource-intensive», which indeed puts limits to its practical use in 32-bit applications.

There exist second branch of JincResize with very small set of kernel coefs (single kernel only as 1/4 of full 2D because of its dual-dimensions symmethry) and it work close to unlimited in output size even in 32bit (also faster because it is not limited by very slow host RAM performance and keep kernel in CPU caches and can reach about 50% of peak FMA performance of CPUs SIMD unit). But it support only fixed scale ratios (also very few implemented because every scale ratio and kernel size and input-output bitdepth require to manually create SIMD program to execute at SIMD co-processor). It is not compiled as 'main' release but it is possible for users of 32bit. Fixed scale ratio is not great limitation because most benefit of 2D upscaling may be at 2x upscale and larger factors may be less visible in difference in comparison with 1D+1D 'classic image' resizers. Same applicable to natural 2x resizers like NNEDI and users live with this. To get any scale ratio you first make resize to close integer ratio scale with 'best' resizer and second pass resize to required size with 'standard' resizer with any scale ratio supported (like Lanczos or any other 1D+1D H+V separate passes). May be it is possible to compile second branch to release as second version/branch for experiments of users ? But users need to install (use) only 1 .dll because functions names are equal and only in supported combinations of scale ratio and kernel size and bitdepth it is call alternative kernel generation and resample engine.

@Asd-g
Copy link
Owner

Asd-g commented Jan 13, 2025

@Asd-g — I agree that it wouldn't be worth the hassle to make the plugin "more 32-bit friendly", so to speak. But I strongly recommend that from now on, this issue /limitation is mentioned and described in the README file. As I said in the topic that I opened @ doom9-dot-org, the 32-bit limitations of the plugin ARE OLD, but they had never been mentioned before, neither in the wikis, nor in the READMEs. Proper documentation matters.

Actually the 32-bit limitation is mainly of the environment (OS memory limitation) not of the plugin. Blaming the plugin that hits the OS limits and crash, it's not correct imo.
Any workarounds in order to avoid crash due to memory limits wouldn't be free - mainly more or less speed sacrificing.
Yes, I will add NOTE even if the project doesn't provide any support for 32-bit environment.

One possible logical optimization about doubling possible memory size at preparation of coefs table: Currently computing of coefs table performed in tmp_array defined at

std::vector<float> tmp_array;

And at the end of function this unaligned growing array is simply copied to aligned working array allocated at

out->factor = static_cast<float*>(aligned_malloc(tmp_array_size * sizeof(float), 64)); // aligned to cache line

So at time of copy preparation OS must allocate 2 large memory arrays and it may fail faster in 32bit address space., Also possible memory allocation fail is by C function not checked and pointer is used as always good.

As logical optimization the computing of coefs may be performed directly into single global use aligned memory area allocated before computing of coefs (its size is possible to compute before start of computing loops). This expected to about double possible memory for larger combination of output frame size and 2D kernel size used. Not very large but sometime may helps.

Also it is good to throw formatted error messages to AVS user about current memory limitations for current output size and kernel size requested (and also it exact program start attempt). Simple crash or Unhandled C++ Exception looks like a bug in software and can not describe the real issue. In this way the software will be self-documented at runtime. The memory limitations in 32bit mode are very dynamic and depend on all previous memory allocations in process address space and the AVS plugin is called at the end of calling application init and AVS core init and possible many other plugins init. And residual memory resources are greatly depend on all previous allocations.

This is already tested (avoiding the use of temporary memory) and it doesn't bring any significant change because the main memory allocation is the problem (it doesn't matter if the vector is temporary or global/permanent) . Currently the max possible memory that can be used is allocated and then only the needed memory is initialized.
If you don't allocate the max memory but instead allocate and initialize for every pixel (thus allocating the actually used memory amount) the speed will be affected. I have to test it.

@Asd-g
Copy link
Owner

Asd-g commented Jan 13, 2025

Btw with avslibplacebo (GPU) you have Jinc too - (ewa_lanczos / ewa_lanczossharp).
32-bit version of libplacebo is also not supported but here you can get one build and test it.

@Asd-g
Copy link
Owner

Asd-g commented Jan 16, 2025

@filler56789, try the attached 32-bit version.

JincResize.zip

@filler56789
Copy link

The new 32-bit DLL is a nicely-done job thus far :-)
No more crashes and no more enigmatic error messages. THUMBS UP.
NOTICE, I haven't dared to test resizing anything above 4K, since I don't have any use-case for 8K anyway ^_^

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 17, 2025

At Win7 it can not load - LoadPlugin: unable to load "JincResize.dll", Module not found. Install missing library?

May be some redistributables required ?

@Asd-g
Copy link
Owner

Asd-g commented Jan 17, 2025

Probably. You can use Dependencies to see if something missing.

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 17, 2025

Image

At other Win7 install at attempt to attach debugger from VS2019 to VirtualDub and open test script it throws error:

MSVCP140_ATOMIC_WAIT.dll is missing . Where to get this ?

@filler56789
Copy link

Hmmm, I use Windows 8.1, so this problem would not happen to me.
@ BOTH: ¿How about compiling with MSYS2+GCC instead of Microsoft Visual C?

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 17, 2025

Installing The latest version is 14.42.34433.0 from https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170

helps with missing .dll . But test script
LoadPlugin("JincResize.dll")

ColorBarsHD()

JincResize(2000,2000)

cause AVSmeter exit without error at script pre-scan and VirtualDub dissapear without error (crash). Will try debugger attaching.

Debugger catch the crash: Exception thrown at 0x053A37A9 (JincResize.dll) in VirtualDub.exe: 0xC0000005: Access violation reading location 0xFFFFFFFF.

Image

@Asd-g
Copy link
Owner

Asd-g commented Jan 17, 2025

I pushed the changes. You can try to build it with msvc because I used clang-cl for that test build.

@Asd-g
Copy link
Owner

Asd-g commented Jan 17, 2025

The previous comment is for DTL.

... @ BOTH: ¿How about compiling with MSYS2+GCC instead of Microsoft Visual C?

You can build it with MSYS2+GCC but it will not work. The plugin must be changed to use C API instead CPP API and then you can use any compiler.
Any particular reason to want to build it with GCC? I will add CMAKE building then you can install clang-cl and build the plugin.

@filler56789
Copy link

filler56789 commented Jan 17, 2025

Any particular reason to want to build it with GCC? I will add CMAKE building then you can install clang-cl and build the plugin.

VARIOUS reasons :-)

  1. GCC + MinGW-w64 + MSYS2 is not overmegabloated;
  2. it's free :-)
  3. it does not skrew Windows Registry, does not install unnecessary services, is not a resource-pig;

These are the reasons why I don't want to use MSVC.
But no, I don't want to learn how to use the clang thing either, I'm already too old for that ^_^

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 19, 2025

Download sources from main and build with MSVC 2019.

For the test script and in Win10 with VirtualDub 32bit

LoadPlugin("JincResize.dll")
ColorBarsHD()
JincResize(2000,2000)
#Jinc144Resize(2000,2000)
#Jinc256Resize(2000,2000)

it is working from JincResize to Jinc144Resize and Jinc256Resize displays error message about not enough memory as expected. Will try to test with Win7 later.

JincResize_x86_msvc2019_190125.zip

@filler56789
Copy link

#Jinc144Resize(2000,2000)
#Jinc256Resize(2000,2000)

it is working from JincResize to Jinc144Resize and Jinc256Resize displays error message about not enough memory as expected. Will try to test with Win7 later.

Very strange indeed. I just tested your new DLL, and VirtualDub32, as expected, opened normally a JincResize256(4096,2304). Tested also with your favorite numbers :-) (2000,2000), and no problem at all as well.

@filler56789
Copy link

filler56789 commented Jan 19, 2025

UPDATE 1:
There seems to be a problem of JincResize with ColorBarsHD especifically.
OR the other way around. 🤔
At first I thought it was because of the default colorspace used by ColorBarsHD (YV24), but then a BlankClip(width=1280, height=720, pixel_type="YV24") accepted normally a Jinc256Resize(2000,2000).
P.S.: F.W.I.W., in all my previous tests I used a REAL video clip (a short FULL HD .avi file, DX50 + MP3).

UPDATE 2: The problem is not in ColorBarsHD() itself or its default colorspace YV24, it's the "preferred resizing dimensions" chosen by DTL (2000x2000) — for some obscure reason, JincResize does not like them sometimes.

@Asd-g
Copy link
Owner

Asd-g commented Jan 20, 2025

Try the attached version with the latest changes.

JincResize.zip

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 20, 2025 via email

@filler56789
Copy link

Try the attached version with the latest changes.

JincResize.zip

SCRIPT:
LoadPlugin("C:\AVSplugins32\JincResize-LAST01.dll")
ColorBarsHD()
KillAudio()

TESTS AND RESULTS

Jinc256Resize(2048,1152) ###OK
Jinc256Resize(2000,2000) ###JincResize: failed to allocate memory for coefficient buffer.
Jinc144Resize(3200,1800) ###OK
Jinc256Resize(3200,1800) ###VirtualDub crashed.

Problem signature:
  Problem Event Name:	APPCRASH
  Application Name:	VirtualDub.exe
  Application Version:	1.10.4.0
  Application Timestamp:	526d9abc
  Fault Module Name:	JincResize-LAST01.dll
  Fault Module Version:	2.1.2.0
  Fault Module Timestamp:	678de8c8
  Exception Code:	c0000005
  Exception Offset:	0000853e
  OS Version:	6.3.9600.2.0.0.256.4
  Locale ID:	2057
  Additional Information 1:	1e5d
  Additional Information 2:	1e5d0d2c6fec87d873489f072c940d92
  Additional Information 3:	7565
  Additional Information 4:	7565cb51cbc927e77546b3c03c8cde1e

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 20, 2025

Try the attached version with the latest changes.

Tested at my old machine with E7500 CPU and 3 GB RAM usable from motherboard and Win7 x64

It is somehow working even with JincResize256(3200,2000) and AVSmeter reports only 1800 MiB RAM used. Looks like the self-growing table takes much less RAM in comparison with old programmers estimation ? Need to check in the debugger some time. The startup takes really many time like 10..15 seconds and full AVSmeter run until fps metering takes about 1 minute. This build working in Win7 after installing that C++ latest redistributables. Also still no crashes happen (though because of very long startup time testing of many sizes and many kernel sizes takes lots of time - will try some time later at work with much faster CPU).

@DTL2020
Copy link
Collaborator Author

DTL2020 commented Jan 20, 2025

Here is x86 test build of single coefs table for 4:4:4 and RGB formats (including YV24 from ColorBarsHD()). Also expected to process alpha plane of RGBA/YUVA formats too (not tested).

For Jinc256Resize(3200,2000) it uses only about 600 MiB RAM as AVSmeter reports. Old versions uses about 1800 MiB RAM. Also plugin init is visibly faster at old CPU.

Though it looks MSVC2019 builds are significantly slower ? At SSE 4.1 old CPUs.

Pull request with these changes requested to main branch.

JincResize_x86_msvc2019_200125.zip

For UV-subsampled formats 4:2:0 and 4:2:2 some optimization also possible but not as benefitical in RAM usage (single coefs table for smaller UV planes but requires more redesign).

@filler56789
Copy link

@Asd-g
Copy link
Owner

Asd-g commented Jan 21, 2025

With the latest changes the plugin can be build with any compiler (including Mingw GCC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants