Skip to content

Commit

Permalink
Talking about indexing safeties and lucene
Browse files Browse the repository at this point in the history
  • Loading branch information
ayende committed Aug 28, 2014
1 parent 3ca1f9f commit b9f2a8b
Showing 1 changed file with 34 additions and 8 deletions.
42 changes: 34 additions & 8 deletions Ch06/Ch06.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Inside the RavenDB indexing process
# Inside the RavenDB indexing implementation

In this chapter, we are going to go over a lot of the theoretical details and reasoning behind how RavenDB indexes work. You'll not actually learn how to _use_ the indexes in this query, but you'll learn all about their details. You can feel free to skip this chapter for now, and go straight ahead to the next one to read the practical details on using indexes. But come back here and read this chapter at your leisure, it contains a lot of very important information about how RavenDB operates internally.

Expand Down Expand Up @@ -227,6 +227,10 @@ When there isn't a lot of work to do, the batch sizes are going to be small, fav

When we have a large number of items to index, we will slowly move to a larger batch size and process more items per batch. This is the high throuput (but high latency) indexing strategy that we talked about. When the load goes down, we'll decrease the batch size to its initial level. This seemingly simple change has dramatic implications on the way RavenDB can handle spikes in traffic. We'll automatically (and transparently) adjust to the new system conditions. That is much nicer than waking your administrators at 3 AM.

### Introducing a new index

I mentioned already that in RavenDB, the process of adding a new index is [[TODO]]

### I/O Considerations

I/O is by far the most costly part of the indexing effort, and the hardest to optimize. This is because RavenDB can run on systems that have persistent RAM disks (so reading from disk is effectively free) to virtual disks whose data is actually fetched over the network (so I/O latency is _very_ high). Note that this applies for both reads and writes. Indexing need to read (potentially a large amount) of documents so it can actually index them, and it need to write (much smaller) amount of data to the index.
Expand All @@ -241,19 +245,41 @@ Because of that, when loading data from the disk we are actually limiting oursel

Because the prefetcher will fetch the next batch in the background, it is more efficent to have a smaller batch and hand it for indexing, while we are fetching the next batch. Otherwise, we'll spend a lot of time in I/O, without using the CPU resources for indexing.

## The index meta information
The other side of I/O is writing the data. Usually, we write a lot less data than we read for indexing, so that is a far less troublesome area. But here, too, we have applied optimizations. When writing to the index, we always write directly to memory first and firstmost. And at the end of the index run, we'll _not_ be writing those changes to disk. Going to disk is expensive, so we're trying to avoid it if possible.

So, when do we write to disk? When one of the following happen:

* When the amount of memory used cross a certain threshold. Currently this is 5 MB^[This is configurable via the Raven/Indexing/FlushIndexToDiskSizeInMb setting], so after the index in memory hit that size, we'll flush this to disk.
* When there are no more documents to index and there is nothing else to do.
* When a certain time threshold has passed without flusing to disk.

This allows us to only go to the disk for writes when we really have to. In the meantime, we are still able to give you full access to the indexed results directly from memory. However, that does raise an interesting question. If we store things in memory, what happen in the precense of failure?

## Index safeties

RavenDB is an ACID database. That means that putting data into RavenDB you are ensured that the only way to lose this data is if you physically take a hammer to the disk drive. The same isn't quite true to indexes. Indexes are updated in the background, and we do a lot of work to ensure that we give you both fast indexing times and fast query times. That means that a lot of the time, we operate on in memory data.

In other words, as soon as there is a failure, all this data goes away. Well, not really. Remember, the data is only in memory up to a point, at which point it get saved to disk. So at worst, if we have a hard crash, we lose _some_ indexing data, but we check this on startup, and that only means that we'll have to re-index the last few documents that hasn't been persisted to disk yet. So we are good. Or are we?

RavenDB uses Lucene^[To be rather more exact, we use [Lucene.NET](http://lucenenet.apache.org/).] as its indexing format, that gives us a lot of power, because Lucene is a very powerful library. Unfortunately, it is anything but simple to work with operationally, I'll touch on that in the next section, but the important fact is that Lucene doesn't gurantee that its data will be safely flushed to disk even if it actually does write to disk^[To the database nerds, the different is the lack of call to fsync() or its moral equivalent when finishing writing. A crash can still cause the data written to the file to be lost].

For its own book keeping purposes, RavenDB keep track of some meta information about the indexes. In particular, that information include the last indexed etag, and the last index time^[This is commonly mistaken as the last time this index run, but this is actually the update time of the last document we indexed.], what is the index lag, etc.
RavenDB take a proactive approach to handle that. On startup, we ensure that the index is healthy, and if needed, we'll reset it to a previous point (or entirely) to make sure that we don't lose data from the index. This is usually only required after a hard machine failure, though. We have run RavenDB through many simulations to make sure that this is the case. In one particular test case, we managed to find a bug after 80 consecutive hard crashes(pulling the power cord from the machine)^[I'll also take this opportunity to thank Tobias Grimm, who was great help finding those kind of issues].

All of that information is exposed in the database statistics. Of particular interest to you would probably be the indexing performance stats. Those can be fine in the output of the database `/stats` endpoint. We'll learn more about it in [Part IV - Operations](#part-iv), but I still want to talk about those briefly now.
In short, documents stored in RavenDB are guranteed to always be there, even if you start pulling power cords and crashing systems. Index entries in the index don't have this promise, but we'll ensure that we fill any missing pieces (by simply re-running the index again over the source data) if something really bad happened. We keep to our respective promises on each side. Documents are safe and consistent, indexes are potentially stale and eventually consistent.

## Lucene

I mentioned earlier that we are using Lucene to store our indexes. But what is it? Lucene is a library to create an inverted index. It is mostly used for full text searching and is the backbone of many search systems that you routinely use. For example, Twitter and Facebook are using Lucene, and so does pretty much anyone else. It has got to the point that other products in the same area always compare themselves to Lucene.

Now, Lucene has a pretty bad reputation^[It is fairly involved to run, from operations point of view.], but it is the de facto industry standard for searching. So it isn't surprising that RavenDB is using it, and doing quite well with it. We'll get to the details about how to use Lucene's capabilities in RavenDB on the next chapter, now I would like to talk about how we are actually using Lucene in RavenDB.

## Working with stale results
I mentioned that successfully running Lucene in production is somewhat of a hassle for operations. This has to do with several reasons:

## Lucene
* Lucene needs to occasionally compact its files (a process called merge). Controlling how and when this is done is key for achieving good performance when you have a lot of indexing activities.
* Lucene doesn't do any sort of verifable writes. If the machine crash midway through, you are open for index corruption.
* Lucene doesn't have any facility for online backup process.
* Optimal indexing and querying speeds depend a lot on the options you use and the exact process in which you work.

## Dynamic indexing
All of that require quite a bit of expertise. We've talked about how RavenDB achieve safety with indexes in the previous section. The others issues are also handled for you by RavenDB. I know that the previous list can make Lucene look scary, but I think that Lucene is a great library, and it is a great solution for handling search.

## How RavenDB index documents?
## Summary

0 comments on commit b9f2a8b

Please sign in to comment.