|
|
Logical view
We're using layered index architecture: each layer adds some features to \ transforms the result of the layer it wraps. Any layer is exposed either an index itself (i.e. it implements IUniqueOrderedIndex or INonUniqueIndex), or exposes a set of such indexes acting nearly as a collection \ provider of them. So probably it's better to think about each layer as of the editable index view of the index below it.
Note: all the transforms happen "on the fly" - i.e. no additional time is required to get the layer itself; moreover, the additional time layer spends on transform is constant per each accessed item.
What is Index?
Index is a set of items ordered by their keys. Quite similar to Dictionary, but with the following differences:
- Index is always ordered, although Dictionary - isn't. So it's possible to get the range of index items (~ from key1 to key2). Getting range enumerator requires logarithmic time; getting a subsequent item in range requires constant time.
- Key is always a part of the item. I.e. there is always a way to extract it from the item (this is usually done by some delegate). Quite frequently such extractor does nothing - it just returns the item itself - finally item is always a Tuple, its first N fields form the key, comparers used by index don't consider other fields at all - so it's safe to return the item as key.
- Indexes supports our IRangeMeasureable interface. This means you can get a measurement for any pre-defined measure (pre-defined - because index must maintain its measurements for each of its page) - not just for the whole index, but for an arbitrary range of its items. The best thing here is that such operation requires logarithmic time, rather than linear (that's what you have in any database now). Certainly, measurement maintenance requires some additional time, but it actually doesn't change the asymptotic time (in O(n) notation) of any index modification operation in general - i.e. Add(TItem item) still requires logarithmic time in both cases.
- Why measures are so important? Actually they allow to bundle OLAP cubes support right into the database. Moreover, they're better than OLAP cubes, since their dimensions don't quantize (i.e. you can get a measure value not for just Q1\2007, Q2\2007, ..., but for any time range).
- Some examples of queries measures may shorten to ~ constant time:
Example:
Select count(*) from ... where X>=@XMin and X<@XMax
Select sum(...), avg(...), min(...), max(...) from ... where X>=@XMin and X<@XMax
etc... |
|
- All above queries will require nearly logarithmic time to complete, if all the mentioned measures are defined for an index indexing X value. Any existing database (at least we didn't found any, which behaves differently) will require a linear time for the same query. Let's show what does this mean:
- We need to fetch ~ 2 pages multiplied by the amount of on-disk transactional layers to accomplish such a query. All the calculations are relatively fast in comparison to page fetch time; let's assume there are 10 transactional layers (good enough estimation even for ~ 1Tb index). So we should fetch just 20 pages to get the query done, or 200ms in the worst case (even if there are 1G of items).
- Traditional database will spend ~ pageSize / itemSize * itemCount / fillFactor time on this. So if itemSize is 100b, the itemCount is 10M and fillFactor is 1 (ok, imagine index is ideal), it will spend at least the time necessary to read 1Gb from the disk. I.e. ~ 10 sec. on rather fast HDDs; if the amount of items will be 1G, it will spend 100 times more time at least (1000 sec.). Why "at least": normally such large indexes are quite fragmented, so even sequential page access requires many seeks. This may easily increase the time by a factor of 10 (ok, you may get 3 hours). Compare this to ours 0.2 sec. in the worst case.
Logical structure layers
Some of our indexes are generic (Index<TKey,TItem>, SortedListIndex) - i.e. they can deal with generally any type of TItem and TKey. Others explicitly require a certain types of them - actually, this is always Tuple (certainly, with arbitrary structure). Some of layers transform their keys and values (i.e. add \ remove some field they need) - normally they don't copy the tuples during these operations, but return TransformedTuple objects handling these transforms on-the-fly.
Ok, let's return back to logical structure now and view following layers.
Physical layer
There are two different index types: Index<TKey,TItem> or SortedListIndex. The second one is used when the amount of items is relatively small (<1000...2000). Both indexes provide the same features, but Index<TKey,TItem> internally uses much more complex approach to get them working, e.g. Measurements are stored in each of its page, but not for the whole index.
Both physical indexes may operate in two modes:
- In-memory mode, read-write access
- In-stream mode, read only access (SortedListIndex just deserializes its content in this mode; Index<TKey,TItem> fetches pages from the disk and utilizes in-memory last-recently-used page cache in this mode).
Physical layer selector (optional)
AutoIndex works here. It automatically chooses the best index implementation on the layer below - Index<TKey,TItem> or SortedListIndex, and switches between them on demand based on the amount of items.
Note: this part isn't implemented now; we're just considering this optimization. For now we're using just Index<TKey,TItem> as physical indexes, although usage of SortedListIndex (it exists, but isn't used here) seems attractive: almost all index changes made by a single transaction are quite small (<100 items), so SortedListIndex usage may act as a good optimization for this common case here.
CompoundIndex (optional)
Compound index is the index merging several indexes having the common part of the key into a single physical index. Basically, it transparently "injects" index identifier right after the common key part.
So if the indexes it exposes have the following structure:
- Index1: long, string
- Index2: long, int, string
- Index3: long, guid
The structure of the physical index will be the following:
- Index: long, int (index identifier), {string | int, string | guid }
Why this index is useful? Imagine that first key value is e.g. user profile key, and the most part of the data we're dealing with is bound to a user profile - i.e. normally it's required to search just inside user's data. Such a transform ensures that user data will be located closely in the underlying physical index - i.e. in a subsequent set of pages, or in a single page. Moreover, if distributed partitioning will occur (ok, some day even this will work), it will never be splat between two different partition servers. Having the data closer to each other ensures it will be accessed much faster: access time mainly depends on the amount of pages we fetch, at least on the primary storage we have now (hard drives) - random page fetch normally implies random read, and the smallest possible time for HDD here is ~ 5ms now, so a single HDD can serve just 100-200 random page fetches per second. CompoundIndex reduces the amount of random requests by using an assumption the data we access is always related to some entity or data type.
Another possible scenario here is efficient date \ time handling for several indexes. Again, we can merge a set of indexes including some date on the first place into a single physical one saying "the data we access is mainly the data related to the same time period".
Why we're thinking such scenarios are common now? The amount of SAAS and web applications is growing quite fast; the amount of users and data they're maintaining is growing even faster. So a generic framework like DataObjects.Net should be fully ready to deal with huge amounts of data. The age of desktop databases with modest sizes has passed - lots of companies try to provide some shared service for the huge amount of users now.
DifferentialIndex
That's what we're using to describe index changes. Differential index is noting more than a wrapper over regular index adding "+" (added), "-" (removed) and "*" (changed) marks against each of its items (they aren't included into the key, of course). Any differential index knows the slice with its own changes, as well as the DifferentialIndex below it (the index to which changes are applied).
We can't change the indexes on disk - once some index (i.e. a slice of differential index) is persisted, it can only be read, and finally - disposed by index garbage collector.
TransactionalIndexSet and TransactionalIndex
TransactionalIndexSet manages index representation for a particular transaction. It allows to:
- Create a TransactionalIndex for the specified Transaction - the changeable index view (TransactionalIndex) you'll get will look exactly like a snapshot isolated index for this transaction should look (i.e. no any changes made after its start will be visible)
- Manage index slice sequence - the chain of the slices forming the index. It provides any part of this chain (from T1 to T2) and supports two chain modify operations: push a new slice on the top of it (that's what happens on transaction commit) and replace two old slices by a new one (that's what background index merge process does). Its garbage collector removes the useless slices (e.g. as the result of merge or rollback), but any slice exists at least for the duration of chain cache period (~ 1 min.) after the time it was actually released from the TransactionalIndexSet chain sequence - this allows TransactionalIndexes to get chain updates much rarely.
- Handle Commits and Rollbacks - by detecting the conflicts on commits and pushing the committed slice into to common chain sequence and releasing the newly created slices (differential indexes) on rollbacks.
- Handle recovery after the failure.
Finally, it logs everything into the TransactionLog it is bound to (redo log), ensures it is flushed on commits and "resurrects" the index state after the last successful commit on recovery.
TransactionalIndex is actually rather simple wrapper over DifferentialIndex - it pushes all the actions made to it into TransactionLog and takes care about the size \ location of the underlying index slice(s). The topmost slice it works with is always in-memory index; but when in reaches a certain limit, it creates a new slice, and serializes the old one to disk in background (it's safe, since old one is already a read-only slice at this point).
NonUniqueIndex
It is a wrapper transforming some unique index into a non-unique one - by rejecting a part of its key.
Note: any relational index is internally described as unique index: if it is primary index, then its key is unique itself; if it is secondary index, it can be exposed as an index with the following key: (NonUniqueKey, ForeignKey). Since ForeignKey is unqiue (as PrimaryKey), the whole key is unique. So NonUnqiueIndex just "strips" the ForeignKey part of the key, and exposes it as non-unique index with just NonUnqiueKey key.
Physical view
Any logical index is actually represented by a stack of physical indexes (slices) laying one over another. Each slice "contains" just the changes made to the logical index state described by a previous slice, and so on - till the "ground" slice. Ground slice contains the changes made to an empty index. Let's try to show this:
Logical index: PK_Customers :
// Visible slices:
- Slice 0 (ground): size: 10Gb, transactions: 1...13352, status: on disk, read-only // All the changed made to an empty index in transactions 1...13352
- Slice 1: size: 4Gb, transactions: 13353...21249, status: on disk, read-only // All the changes made in transactions 13353...21249
- Slice 2: size: 3Gb, transactions: 21250...27498, status: on disk, read-only
...
- Slice 10: size: 100Mb, transactions: ..., status: on disk, read-only
...
- Slice 100: size: 100Kb, transactions: 32453...32460, status: in memory, read-only
- Slice 101: size: 10Kb, transactions: 32461...32462, status: in memory, read-only
- Slice 102: size: 1Kb, transactions: 32464...32464, status: in memory, read-only
// Index merge process slices (not yet visible, but managed):
- Slice 1.M: size: 6Gb, transactions: 13353...27498, status: on disk, serializing // A result of merging Slice 1 & Slice 2
...
- Slice 99.M: size: 150Kb, transactions: ... ...32460, status: in memory, serializing // A result of merging Slice 99 & Slice 100
// Active transaction slices (visible to a particular transaction)
...
- Slice 101.32463: size: 1Kb, transactions: 32463...32463, status: in memory, read-write // All the changes made in transaction 32463, that was started after commit of transaction 32462 (Slice 101), and isn't completed yet
- Slice 102.32465: size: 1Kb, transactions: 32463...32463, status: in memory, read-write // All the changes made in transaction 32465, that was started after commit of transaction 32464 (Slice 102), and isn't completed yet |
|
So that's how everything should look like. Slice names here are just readable examples (real names will be different). You can study what benefits such index representation may give us, and what problems we may get with it comparing this to regular indexing approach (concurrent access to a single B+ tree based index) in Indexing architecture: pros and cons.
See also
|
|