@node Introduction to TokuDB @chapter Introduction to TokuDB TokuDB is an embedded transactional database that provides very fast insertions and range queries on dictionaries. TokUDB provides transactions with full ACID properties. TokuDB is a library that can be linked directly into applications, making it an @dfn{embedded} database. TokuDB provides an API that is very similar to the Berkeley@tie{}DB. @node Dictionaries and Associative Memories @section Dictionaries and Associative Memories Here we describe what we mean by ``dictionary''. A @dfn{dictionary} is an ordered set of key-data pairs. @itemize @bullet @item An @dfn{insertion} stores a key-data pair into a dictionary. (One of the API calls that inserts pairs is called @code{DB->put}.) @item A @dfn{point query} finds the data associated with a particular key. (One of the API calls that executes a point query is called @code{DB->get}.) @item A @dfn{range query} iterates through all the key-value pairs in a range. For example, in a phone book, you could step through all the people whose last names are between @samp{Jones} and @samp{Smith}. In the API, range queries are supported using cursors. @end itemize An @dfn{associative memory} is like a dictionary but it supports only insertions and point queries, but not range queries. Often hash tables are used to implement associative memories. (In some languages, such as Perl, an ``dictionary'' is really an associative memory.) @node Asymptotic Performance of Fractal Trees @section Asymptotic Performance of Fractal Trees For in-memory dictionaries, trees are often the data structure of choice. For example, red-black trees or AVL trees implement all of the above operations in @math{O(\log N)} time for in-memory data structures, where $math{N} is the number of entries in the dictionary (assuming constant-size key-data pairs). For disk-resident dictionaries, most systems (such as Berkeley@tie{}DB) employ B-trees, which can implement insertions and point queries with @math{O(\log_B N)} disk operations in the worst case, where @math{B} is the block size of the B-tree (assuming constant-size key-data pairs). The @dfn{effective block size} of a disk is the size at which the disk-head movement does not dominate the cost of storing or retrieving data off the disk. The effective block size is difficult to calculate for actual disk systems, since it depends on how far the head must move (the disk head moves short distances faster than long distances, reducing the effective block size), whether the data is near the center of the disk (where the bandwidth is small and hence the effective block size tends to be small) or at the outer edge of the disk (where the bandwidth is larger), and on how much prefetching the operating system and disk firmware perform. TokuDB, on the other hand, employs @dfn{fractal trees}, a new class of data structures that provide the same API as B-trees, but are much faster. Fractal tree point-queries have the same asymptotic cost as B-trees (@math{O(\log_B N)}), but insertions require only @math{O((\log N)/B)} disk operations on average, where @math{B} is the effective block size of the disk system. How much faster is that in concrete terms? In 2007, the effective block size of a disk is about a megabyte. If you use 100-byte key-data pairs, then you can fit about 10,000 pairs into an effective block. If you have a 100 Terabyte database then @math{N=2^{40}}, so on average a fractal tree requires about @math{(\log N)/B = 40/10000 = 1/250} of a disk transfer per insertion. That is, on average, you can perform 250 insertions per disk I/O. In contrast, a traditional B-tree equires @math{\log_B N = 2} disk I/O's per insertion. In practice, because of caching in main memory, traditional B-trees require 1 disk read and 1 disk write per insertion. Thus TokuDB's insertions are orders of magnitude faster than dictionaries based on traditional B-trees. Range queries are also faster for TokuDB than in a typical B-tree. Since B-trees usually use blocks sizes that relatively small compared to the effective block size of the disk they require many disk-head movements to traverse a range. Fractal trees use block sizes that are the effective block size of the disk.