Skip to content
eyepatchParrot edited this page May 17, 2018 · 7 revisions

180516

  • Fixing the processes to particular cores seems to affect the runtime when using a single thread.
  • The way that threading handles permutations, it doesn't report on the subset_index, so we don't know which permutations were used in other threads.

Willard's 3 cycle does get log log n on most non-uniform distributions, but it's unsuitable because of the 3-cycle and because it relies on sqrt(y_R - y_L) which in our case ~ 2^32 >> R - L

Iterative Methods of Root Finding

Most promising methods seem to be Ridders, Mullers, Inverse Quadratic interpolation. Derivative methods seem unlikely to work for exponentials. interpolation-binary seems difficult to make work due to segfault + padding issues. Otherwise, have to try using max and min on bounds, which seems expensive in the core loop.

Householder methods use increasingly higher order derivatives as part of the search. Might be able to pre-compute derivative approximation functions to use these methods.

Fake position uses a two-point derivative approximation, but there are also higher-order derivative approximations, consider using more points in the specified ratios.

Exponential functions are unique determined by three points such that y = ae^(bx). Because the exponential changes so much at each increment of x, the small variations that make fitting a uniform distribution difficult are insignificant in comparison, so binary search should find a reasonable approximation. https://math.stackexchange.com/questions/48275/how-can-i-find-the-equation-of-an-exponential-equation-given-a-set-of-points This is similar to Ridders' method I think.

Also investigate Van Wijngaarden-Dekker-Brent, which I think may have an unnecessary bisection step, and Pade approximants.

Remember that since you decide how the lines are to be chosen, you can choose your cover metric to be single directional, only covers in one direction. Even if the line might cover the opposite direction. Can, of course, choose to reverse if necessary, but that might require additional code. However, by being unidirectional, the linear search doesn't require any logic to modify. Under this metric, we can fix the number of models used, and minimize the distance needed. This can be done with a binary search on the distance and determine feasibility by choosing the models that cover most up to a constant count.

Graefe

B-tree nodes are organized by keeping the records unsorted and the indirection vector sorted. This impacts the cost of searching. Consider searching this data structure. [1] One of my observations is that on our test setup, with thousands of keys, in many settings, interpolation search isn't so unbeatable even with the keys come from a uniform distribution. However, for sequential keys, interpolation search should do much better than on a uniform distribution since no additional linear search is needed. [2] Graefe describes code size both in terms of cache and software development. I'm not sure that software development code in a high level language is related to code that goes into cache (asm).

180130 - Plan of Action

Peter

  1. Search for a better line to fit the data. Naive algorithm is w.c. n^3, b.c. n^2. Current results show both improvements, no change, and getting worse depending on the set-up conditions. The naive search is also prohibitively expensive for large data sets, so I've been reviewing some literature on L1 regressions searching for a cheaper approximation.
  2. Pre-compute many models to fit the data, and execute in parallel. Current results show a long tail of lines (~50) to cover whole set using greedy algorithm. Investigating allowing for search of neighbors or only covering part of the set.
  3. Recursively interpolate using the same slope, adjusting the intercept each time. Current results show improvements at small array sizes and a large decrease in performance in one size transition, which suggests to me anomalous behavior.
  4. Re-work interpolation into intercept-slope form. Current results suggest worse results for integers, and varying results for floats.

Yannis

  1. Attempt to parameterize a data distribution, and then find the inverse function to perform an interpolation. First, investigate zipf directly, then exponential splines.
  2. Consider more contexts for searching sorted arrays, and the application of larger keys. Our discussion around larger keys suggested that under certain distribution assumptions, the interpolation arithmetic should not be much worse, but other distribution assumptions make interpolation arithmetically expensive.
  3. Develop analytical model to predict performance of the algorithms.
Clone this wiki locally