by Brendan O'Connor,
Java utilities for statistics/machinelearning and various supporting tools. (Often intended for NLP applications, though not much NLP in this library.) This needs a better name; currently it's "myutil",
The idea is to be a library of functions for well-known algorithms, as opposed to a grand ML/NLP framework, because those are never as useful as one would hope (in my experience at least).
This is under active development so any of it may be broken at any time. If there are comments with a testing procedure, that may be a good sign.
Math/stats/opt things:
- lots of array/matrix math and manipulation utilities. Unlike Colt or Jama, uses the more natural Java arrays and array-of-arrays representations. Also includes all Java standard library methods, because I can't remember which class is which.
- generic MCMC algorithms: Slice sampling, Metropolis-Hastings
- LibLBFGS: a port of LibLBFGS to Java. Seems to behave similarly as Stanford's OWLQN port, but it's more efficient.
- FastRandom: a random number generator that's 10 times faster than the Java standard library's.
- GaussianInference: conjugate posterior inference (exact and sampling) for Gaussian scalars, linear regression, and DLM's (Kalman filter, smoother, FFBS)
- MVNormal2: linear algebra inference and samplers for multivariate normals (ported from Mallet)
- LNInference: logistic normal MAP and samplers
- discrete chain inference: Viterbi, forward-backward, FFBS
- Online algorithms: Vitter reservoir sampling (ReservoirSampler), and Welford running mean/variance (OnlineNormal1d(Weighted))
- some other math/stats functions
Non-math-y things:
- ThreadUtil: basically ThreadPool wrappers for divide-and-conquer workloads
- printing utilities (mostly)
- BasicFileIO: IO utilities
- Vocabulary: feature name/numberization (I'd love to get a better/more efficient one here)
- Timer: timings for large sections of your program
- JsonUtil: very simple wrappers for Jackson
NLP things:
- corenlp/: runners for Stanford CoreNLP that work with JSON or XML-based one-line-per-document formats. Once you have thousands of documents, these formats are typically much faster to deal with than CoreNLP's one-document-per-file strategy. They're more Hadoop-friendly too. To use these, need to drop in the model file (stanford-corenlp-3.2.0-models.jar) into lib/stanford_extras
Example models:
- In the root package, example implementation of CGS LDA. When working on a related model, I copy-and-paste one to get started then hack it up. scripts/ has viewers for it.
Let's say new code is GPL version 2. Note there's code from other libraries inside here too, like JAMA and LibLBFGS and the Java SDK, which have their own licenses.