-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
255 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,211 @@ | ||
|
||
# Automatic Indexes & the query optimizer | ||
# The query optimizer and dynamic queries | ||
|
||
We started the previous chapter by talking about the following query, and how it _doesn't_ work. In this chapter, we'll spend our time talking about the entire process that make it possible to run such queries. | ||
|
||
var recentOrdersQuery = | ||
from order in session.Query<Order>() | ||
where order.Company == "companies/1" | ||
select order; | ||
|
||
var recentOrders = recentOrdersQuery.Take(3).ToList(); | ||
|
||
In RavenDB terms, the recent orders query is actually a dynamic query. We don't explicitly specify what index we want to use, and we just run the query as if we had no care in the world, and let RavenDB take care of all the details. But how is this actually done? | ||
|
||
The RavenDB Client API translate the above query into the following REST call: | ||
|
||
GET /databases/Northwind/indexes/dynamic/Orders?&query=Company:companies/1&pageSize=3 | ||
|
||
If we break the call to its components parts, we'll have: | ||
|
||
* /databases/Northwind - Using the Northwind database. | ||
* /indexes/dynamic/Orders - Making a dynamic query on the Orders collection | ||
* ?&query=Company:companies/1 - Where the Company property contains the "companies/1" value. | ||
* &pageSize=3 - Get the first 3 results | ||
|
||
A dynamic query like that is going to be sent to the query optimizer for processing. | ||
|
||
## The Query Optimizer | ||
|
||
The query optimizer is responsible for handling dynamic queries. The first part of its duties is to actually dispatch the query to the appropriate index. Listing 7.1 shows us how we can figure out what it actually did. | ||
|
||
```{caption="{Using the query statistics to find the index used for the query}" .cs} | ||
RavenQueryStatistics stats; | ||
IRavenQueryable<Order> recentOrdersQuery = | ||
from order in session.Query<Order>() | ||
.Statistics(out stats) | ||
where order.Company == "companies/1" | ||
select order; | ||
List<Order> recentOrders = recentOrdersQuery.Take(3).ToList(); | ||
Console.WriteLine("Index used was: " + stats.IndexName); | ||
``` | ||
|
||
When we run the code in Listing 7.1, we will see that the indexed used was "Orders/Totals", one of the sample indexes in the Northwind database. | ||
How did this happen? We certainly didn't specify that ourselves. | ||
|
||
 | ||
|
||
What happened was that the query optimizer got this query, and then it went over the indexes, looking for all the indexes that index the Orders collection and output the Company property to the index. When it found such an index, it threw a party and then executed the query on that index. | ||
|
||
> **The query statistics** | ||
> | ||
> The `RavenQueryStatistics` and the `.Statistics(stats)` call provide a wealth of information about the just executed query. Among the details you can get from the query statistics you have: | ||
> | ||
> * Whatever the index was stale or not. | ||
> * The duration of the query on the server side. | ||
> * The total number of results (regardless of paging). | ||
> * The name of the index that this query run against. | ||
> * The last document etag indexed by the index. | ||
> * The timestamp of the last document indexed by the index. | ||
> | ||
> In addition to that, you can use the `.ShowTiming()` to get additional detailed information about the execution time of the query on the server. | ||
But what would happen if we executed a query that had no index covering it? For example, what would happen if we run the code in Listing 7.2? | ||
|
||
```{caption="{No index exists to cover this query}" .cs} | ||
RavenQueryStatistics stats; | ||
IRavenQueryable<Order> recentOrdersQuery = | ||
from order in session.Query<Order>() | ||
.Statistics(out stats) | ||
where order.Company == "companies/1" | ||
&& order.ShipTo.Country == "Italy" | ||
select order; | ||
List<Order> recentOrders = recentOrdersQuery.Take(3).ToList(); | ||
Console.WriteLine("Index used was: " + stats.IndexName); | ||
``` | ||
|
||
Note that the only change between Listing 7.1 and Listing 7.2 is that the addition of `&& order.ShipTo.Country == "Italy"` to the query. But because we have this additional property, we can't use any existing index. What will the query optimizer will do? | ||
|
||
Well, exeucting this code tells us that the index used is named: "Auto/Orders/ByCompanyAndShipTo_Country". And if we look at the studio, it is defined as: | ||
|
||
from doc in docs.Orders | ||
select new | ||
{ | ||
ShipTo_Country = doc.ShipTo.Country, | ||
Company = doc.Company | ||
} | ||
|
||
What just happened? We didn't have this index just a minute ago! | ||
|
||
What happened was that the query optimizer got involved. It got the query, which required us to have an index on Orders that indexed both the Company property and the ShipTo.Country property. But there was no such index in existance. At that point, the query optimizer got depressed, tried to drink away its troubltes and considered vacation in Maui when that failed. Coming back from vacation, tanned and much happier, the query optimizer got down to business. We have a query that we have no index for, and RavenDB does not allow the easy and nasty solution of just iterating over all the documents in the database, also know as table scans, also known as 3 AM wakeup calls. | ||
|
||
So the query optimizer decided to _create_ such an index. And hence, an index is born. The proud parent watched over the new index, ensuring that it does its job properly, and finally released it to the wild, to roam free and answer any queries that would be directed its way. | ||
|
||
> **Ad hocs queries weren't suppsoed to be there** | ||
> | ||
> When RavenDB was just starting (late 2010), we already had a user base and a really cool database. What we didn't have was ad hoc queries. | ||
> If you wanted to query the database, you had to write an index to do os, and then you had to explicitly specify which index would be used in each query. | ||
> That was a hassle, but there was really no other way around that. We didn't want to do anything that would force a table scan, and there was no other way to support this feature. | ||
> | ||
> Then [Rob Ashton](http://codeofrob.com/) pops in the mailing list and start sending us [crazy bug reports](https://github.com/ayende/ravendb/blob/master/Raven.Tests/Document/Game.cs#L25) with very complex map/reduce indexes^[In general, I found that Rob was very gifted in the art of very quickly breaking any of my code that got near his orbit]. | ||
> And he start making silly proposals about dynamic queries, stuff that would _obviously_ not work. | ||
> | ||
> The end result was that I asked him to [show me some code](https://groups.google.com/forum/#!searchin/ravendb/rob$20ashton|sort:date/ravendb/eSyQ8owhQeI/1zaDFRNWWU0J), with the fully expectation that I would never hear from him again. | ||
> | ||
> He came back a day later with a functional proof of concept. After I managed to pick my jaw off the floor, I was able to figure out what he was doing and got _very_ excited^[RavenDB is Open Source Software exactly because of moments like that, when someone can come and turn the whole thing around with a new idea.]. | ||
> | ||
> Once we had the initial idea, we basically took it up and run with it. And the result is a very successful feature and this chapter, too. | ||
Leaving aside the anthropomorphism of the query optimizer, what is going on is that the query optimizer reads the query, and try to match a relevant index for it. If it can't find a relevant index, it will create the index that _can_ answer this query. It will then start executing the query against the index. Because indexing can take some time, it will wait until the query has enough results to fill a single page, or 15 seconds has passed (or it completed indexing, of course) before it will return the results to the user^[To do otherwise would _ensure_ that the very first dynamic query of a certain type would always result in no results being returned, which would be... confusing.]. | ||
|
||
### The life cycle of automatic indexes | ||
|
||
Indexes that are created by the query optimizer are named with the `Auto/` prefix. This indicate that they were created by the query optimizer and that they are being _managed_ by the query optimizer as well. | ||
|
||
An index in RavenDB is very efficient, but it isn't actually free. When we allow the query optimizer to just create an index on the fly, we also run the risk that so many indexes would be created that it would be a performance hog. In order to handle that, the query optimizer doesn't just create an index and set it free, it is actively managing it. | ||
|
||
A dynamic index (any index with the `Auto/` prefix, which only the query optimizer should generate) is being tracked for usage. If a dynamic index isn't being used, it will continiously downgraded and eventually removed. The actual process is a bit involved and contain some heuristics to avoid index churn (creating and deleting auto indexes). | ||
|
||
An auto index is created, and from that point, it is things as usual as far as that index is concerned. It get filled in pretty much the same way as any other index. However, if the index isn't queried, we have to make a decision about what to do with this index. That depend on the index's age. If the index is older than an hour, it means that it had enough queries to hold it in the air for a long period of time, that in turn means that it is very likely that it will be used again. | ||
|
||
In that case, the index goes down the reduce resource usage mode, which will first mark it as idle, and eventually abandoned. But if this is a new index, that means that we probably just tried some query out, or had a one off administrative query. In that case, we'll retire the index into an idle mode, and after that we'll delete it. | ||
|
||
Over time, this result in a system that has only the indexes that it needs, and usually most of those indexes were created for us by the query optimizer. Another nice feature is that changes in behavior (for example, because of a new release) will result in the database optimizing itself for that behavior. | ||
|
||
But the query optimizer still have some tricks to show us, let us talk about index merging. | ||
|
||
### Index merging | ||
|
||
A common question that is being asked in the mailing list is: "Is it better to have few fat indexes or more numerous lean indexes?" | ||
|
||
The answer to that question is quite emphatically that we want the fewer and bigger indexes. A lot of the cost around indexing is around the actual indexing itself. The cost per indexed field is so minor it doesn't usually matter very much at all. That means that we have somewhat of a problem here, we create indexes on the fly, and most of the time they are created as indexes to answer a very particular query. Wouldn't that cause a lot of small indexes to be created? | ||
|
||
Listing 7.3 shows several such queries run over products, each with a different set of fields it needs. | ||
|
||
```{caption="{Running multiple dynamic queries on Products}" .cs} | ||
RavenQueryStatistics stats; | ||
var q = from product in session.Query<Product>() | ||
.Statistics(out stats) | ||
where product.Discontinued == false | ||
select product; | ||
q.ToList(); | ||
Console.WriteLine(stats.IndexName); | ||
q = from product in session.Query<Product>() | ||
.Statistics(out stats) | ||
where product.Category == "categories/1" | ||
select product; | ||
q.ToList(); | ||
Console.WriteLine(stats.IndexName); | ||
q = from product in session.Query<Product>() | ||
.Statistics(out stats) | ||
where product.Supplier == "suppliers/2" | ||
&& product.Discontinued == false | ||
select product; | ||
q.ToList(); | ||
Console.WriteLine(stats.IndexName); | ||
``` | ||
|
||
The output of the code in Listing 7.3 is very interesting: | ||
|
||
Auto/Products/ByDiscontinued | ||
Auto/Products/ByCategoryAndDiscontinued | ||
Auto/Products/ByCategoryAndDiscontinuedAndSupplier | ||
|
||
Just by the index names, you can probably guess what is going on. In order to reduce the number of indexes involved, when we create a new dynamic index, and other dynamic indexes for that collection also exists, we will merge them all. | ||
> **What about the old dynamic indexes** | ||
> | ||
> What does the query optimizer do with the Auto/Products/ByDiscontinued index once it creates the Auto/Products/ByCategoryAndDiscontinued index? | ||
> And what does it does with the Auto/Products/ByCategoryAndDiscontinued once it creates the Auto/Products/ByCategoryAndDiscontinuedAndSupplier index? | ||
> | ||
> The surprising answer is that it doesn't do anything. It doesn't need to. Those indexes are there, and would be cleaned out by the usual query optimizer process once they stop being used. | ||
> Eventually, they will be removed, but there is no rush to do this right now, and that might actually hurt things. | ||
The end result of the code in Listing 7.3 would be a single fat index that can answer all our queries about the documents in a particular collection. That is the most optimal result in terms of indexing decisions. But that raise an interesting question, what would happen if we run the same code again, against the current database with the new automatic indexes? | ||
|
||
If you'll do that, you'll see that the only index being used is: Auto/Products/ByCategoryAndDiscontinuedAndSupplier. | ||
|
||
### Dynamic index selection | ||
|
||
When the query optimizer has more than a single choice for the index, in need to make a selection between those choices. And the choice it makes it usually based on a very simple metric. The width of the index. The wider the index, the more work it does, the easier it is to send it more queries, so we'll favor it. | ||
|
||
Behind this decision there is the knowledge that automatic indexes that don't get enough queries will be removed, so just the fact that we aren't directly any queries to an index would end up removing it for us. | ||
|
||
However, there is one scenario in which RavenDB will select an index to execute a query even if it isn't the widest index that can serve. Consider the following case, you have a document with a lot of documents, and an existing auto index (such as Auto/Products/ByCategoryAndDiscontinued) and all of a sudden a new query comes by that require you to create a new index (such as Auto/Products/ByCategoryAndDiscontinuedAndSupplier). | ||
|
||
Queries that used to be served by Auto/Products/ByCategoryAndDiscontinued can now be served by Auto/Products/ByCategoryAndDiscontinuedAndSupplier, but there is a problem. Auto/Products/ByCategoryAndDiscontinuedAndSupplier is a new index, and as such, didn't have the chance to go through all the data. If we direct queries to it that can be answered by Auto/Products/ByCategoryAndDiscontinued, we might miss out on information that Auto/Products/ByCategoryAndDiscontinued already have indexed. | ||
|
||
Because of that, we also consider how up to date an index is, and we'll prefer the freshest index first, then the widest. | ||
|
||
> **Querying the query optimizer** | ||
> | ||
> You can ask the query optimizer what it was thinking, to give a particular index the chance to run a particular query. You can do that using the following REST call: | ||
> GET /databases/Northwind/indexes/dynamic/Products?query=Category:categories/2&explain=true | ||
> | ||
> * /databases/Northwind - the database we use | ||
> * /indexes/dynamic/Products - dynamic query on the Products collection | ||
> * ?query=Category:categories/2 - querying all products in a particular category | ||
> * &explain=true - explain what you were _thinking_. | ||
> | ||
> This can be helpful if you want to understand a particular decision, although most of the time, those are self evident. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.