mirror of
				https://github.com/sqlite/sqlite.git
				synced 2025-10-31 18:11:01 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			291 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			291 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # The JSONB Format
 | |
| 
 | |
| This document describes SQLite's JSONB binary encoding of
 | |
| JSON.
 | |
| 
 | |
| ## 1.0 What Is JSONB?
 | |
| 
 | |
| Beginning with version 3.45.0 (circa 2024-01-01), SQLite supports an
 | |
| alternative binary encoding of JSON which we call "JSONB".  JSONB is
 | |
| a binary format that stored as a BLOB.
 | |
| 
 | |
| The advantage of JSONB over ordinary text RFC 8259 JSON is that JSONB
 | |
| is both slightly smaller (by between 5% and 10% in most cases) and
 | |
| can be processed in less than half the number of CPU cycles.  The built-in
 | |
| [JSON SQL functions] of SQLite can accept either ordinary text JSON
 | |
| or the binary JSONB encoding for any of their JSON inputs.
 | |
| 
 | |
| The "JSONB" name is inspired by [PostgreSQL](https://postgresql.org), but the
 | |
| on-disk format for SQLite's JSONB is not the same as PostgreSQL's.
 | |
| The two formats have the same name, but they have wildly different internal
 | |
| representations and are not in any way binary compatible.
 | |
| 
 | |
| The central idea behind this JSONB specification is that each element
 | |
| begins with a header that includes the size and type of that element.
 | |
| The header takes the place of punctuation such as double-quotes,
 | |
| curly-brackes, square-brackets, commas, and colons.  Since the size
 | |
| and type of each element is contained in its header, the element can
 | |
| be read faster since it is no longer necessary to carefully scan forward
 | |
| looking for the closing delimiter.  The payload of JSONB is the same
 | |
| as for corresponding text JSON.  The same payload bytes occur in the
 | |
| same order.  The only real difference between JSONB and ordinary text
 | |
| JSON is that JSONB includes a binary header on
 | |
| each element and omits delimiter and separator punctuation.
 | |
| 
 | |
| ### 1.1 Internal Use Only
 | |
| 
 | |
| The details of the JSONB are not intended to be visible to application
 | |
| developers.  Application developers should look at JSONB as an opaque BLOB
 | |
| used internally by SQLite.  Nevertheless, we want the format to be backwards
 | |
| compatible across all future versions of SQLite.  To that end, the format
 | |
| is documented by this file in the source tree.  But this file should be
 | |
| used only by SQLite core developers, not by developers of applications
 | |
| that only use SQLite.
 | |
| 
 | |
| ## 2.0 The Purpose Of This Document
 | |
| 
 | |
| JSONB is not intended as an external format to be used by
 | |
| applications.  JSONB is designed for internal use by SQLite only.
 | |
| Programmers do not need to understand the JSONB format in order to
 | |
| use it effectively.
 | |
| Applications should access JSONB only through the [JSON SQL functions],
 | |
| not by looking at individual bytes of the BLOB.
 | |
| 
 | |
| However, JSONB is intended to be portable and backwards compatible
 | |
| for all future versions of SQLite.  In other words, you should not have
 | |
| to export and reimport your SQLite database files when you upgrade to
 | |
| a newer SQLite version.  For that reason, the JSONB format needs to
 | |
| be well-defined.
 | |
| 
 | |
| This document is therefore similar in purpose to the
 | |
| [SQLite database file format] document that describes the on-disk
 | |
| format of an SQLite database file.  Applications are not expected
 | |
| to directly read and write the bits and bytes of SQLite database files.
 | |
| The SQLite database file format is carefully documented so that it
 | |
| can be stable and enduring.  In the same way, the JSONB representation
 | |
| of JSON is documented here so that it too can be stable and enduring,
 | |
| not so that applications can read or writes individual bytes.
 | |
| 
 | |
| ## 3.0 Encoding
 | |
| 
 | |
| JSONB is a direct translation of the underlying text JSON. The difference
 | |
| is that JSONB uses a binary encoding that is faster to parse compared to
 | |
| the detailed syntax of text JSON.
 | |
| 
 | |
| Each JSON element is encoded as a header and a payload.  The header
 | |
| determines type of element (string, numeric, boolean, null, object, or
 | |
| array) and the size of the payload.  The header can be between 1 and
 | |
| 9 bytes in size.  The payload can be any size from zero bytes up to the
 | |
| maximum allowed BLOB size.
 | |
| 
 | |
| ### 3.1 Payload Size
 | |
| 
 | |
| The upper four bits of the first byte of the header determine size of the
 | |
| header and possibly also the size of the payload.
 | |
| If the upper four bits have a value between 0 and 11, then the header is
 | |
| exactly one byte in size and the payload size is determined by those
 | |
| upper four bits.  If the upper four bits have a value between 12 and 15,
 | |
| that means that the total header size is 2, 3, 5, or 9 bytes and the
 | |
| payload size is unsigned big-endian integer that is contained in the
 | |
| subsequent bytes.  The size integer is the one byte that following the
 | |
| initial header byte if the upper four bits
 | |
| are 12, two bytes if the upper bits are 13, four bytes if the upper bits
 | |
| are 14, and eight bytes if the upper bits are 15.  The current design
 | |
| of SQLite does not support BLOB values larger than 2GiB, so the eight-byte
 | |
| variant of the payload size integer will never be used by the current code.
 | |
| The eight-byte payload size integer is included in the specification
 | |
| to allow for future expansion.
 | |
| 
 | |
| The header for an element does *not* need to be in its simplest
 | |
| form.  For example, consider the JSON numeric value "`1`".
 | |
| That element can be encode in five different ways:
 | |
| 
 | |
|   *  `0x13 0x31`
 | |
|   *  `0xc3 0x01 0x31`
 | |
|   *  `0xd3 0x00 0x01 0x31`
 | |
|   *  `0xe3 0x00 0x00 0x00 0x01 0x31`
 | |
|   *  `0xf3 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x31`
 | |
| 
 | |
| The shortest encoding is preferred, of course, and usually happens with
 | |
| primitive elements such as numbers.  However the total size of an array
 | |
| or object might not be known exactly when the header of the element is
 | |
| first generated.  It is convenient to reserve space for the largest
 | |
| possible header and then go back and fill in the correct payload size
 | |
| at the end.  This technique can result in array or object headers that
 | |
| are larger than absolutely necessary.
 | |
| 
 | |
| ### 3.2 Element Type
 | |
| 
 | |
| The least-significant four bits of the first byte of the header (the first
 | |
| byte masked against 0x0f) determine element type.  The following codes are
 | |
| used:
 | |
| 
 | |
| <ol>
 | |
| <li type="0"><p><b>NULL</b> →
 | |
| The element is a JSON "null".  The payload size for a true JSON NULL must
 | |
| must be zero.  Future versions of SQLite might extend the JSONB format
 | |
| with elements that have a zero element type but a non-zero size.  In that
 | |
| way, legacy versions of SQLite will interpret the element as a NULL 
 | |
| for backwards compatibility while newer versions will interpret the
 | |
| element in some other way.
 | |
| 
 | |
| <li value="1"><p><b>TRUE</b> →
 | |
| The element is a JSON "true".  The payload size must be zero for a actual
 | |
| "true" value.  Elements with type 1 and a non-zero payload size are
 | |
| reserved for future expansion.  Legacy implementations that see an element
 | |
| type of 1 with a non-zero payload size should continue to interpret that
 | |
| element as "true" for compatibility.
 | |
| 
 | |
| <li value="2"><p><b>FALSE</b> →
 | |
| The element is a JSON "false".  The payload size must be zero for a actual
 | |
| "false" value.  Elements with type 2 and a non-zero payload size are
 | |
| reserved for future expansion.  Legacy implementations that see an element
 | |
| type of 2 with a non-zero payload size should continue to interpret that
 | |
| element as "false" for compatibility.
 | |
| 
 | |
| <li value="3"><p><b>INT</b> →
 | |
| The element is a JSON integer value in the canonical
 | |
| RFC 8259 format, without extensions.  The payload is the ASCII
 | |
| text representation of that numeric value.
 | |
| 
 | |
| <li value="4"><p><b>INT5</b> →
 | |
| The element is a JSON integer value that is not in the
 | |
| canonical format.   The payload is the ASCII
 | |
| text representation of that numeric value.  Because the payload is in a
 | |
| non-standard format, it will need to be translated when the JSONB is
 | |
| converted into RFC 8259 text JSON.
 | |
| 
 | |
| <li value="5"><p><b>FLOAT</b> →
 | |
| The element is a JSON floating-point value in the canonical
 | |
| RFC 8259 format, without extensions.  The payload is the ASCII
 | |
| text representation of that numeric value.
 | |
| 
 | |
| <li value="6"><p><b>FLOAT5</b> →
 | |
| The element is a JSON floating-point value that is not in the
 | |
| canonical format.   The payload is the ASCII
 | |
| text representation of that numeric value.  Because the payload is in a
 | |
| non-standard format, it will need to be translated when the JSONB is
 | |
| converted into RFC 8259 text JSON.
 | |
| 
 | |
| <li value="7"><p><b>TEXT</b> →
 | |
| The element is a JSON string value that does not contain
 | |
| any escapes nor any characters that need to be escaped for either SQL or
 | |
| JSON.  The payload is the UTF8 text representation of the string value.
 | |
| The payload does <i>not</i> include string delimiters.
 | |
| 
 | |
| <li value="8"><p><b>TEXTJ</b> →
 | |
| The element is a JSON string value that contains
 | |
| RFC 8259 character escapes (such as "<tt>\n</tt>" or "<tt>\u0020</tt>").
 | |
| Those escapes will need to be translated into actual UTF8 if this element
 | |
| is [json_extract|extracted] into SQL.
 | |
| The payload is the UTF8 text representation of the escaped string value.
 | |
| The payload does <i>not</i> include string delimiters.
 | |
| 
 | |
| <li value="9"><p><b>TEXT5</b> →
 | |
| The element is a JSON string value that contains
 | |
| character escapes, including some character escapes that part of JSON5
 | |
| and which are not found in the canonical RFC 8259 spec.
 | |
| Those escapes will need to be translated into standard JSON prior to
 | |
| rendering the JSON as text, or into their actual UTF8 characters if this
 | |
| element is [json_extract|extracted] into SQL.
 | |
| The payload is the UTF8 text representation of the escaped string value.
 | |
| The payload does <i>not</i> include string delimiters.
 | |
| 
 | |
| <li value="10"><p><b>TEXTRAW</b> →
 | |
| The element is a JSON string value that contains
 | |
| UTF8 characters that need to be escaped if this string is rendered into
 | |
| standard JSON text.
 | |
| The payload does <i>not</i> include string delimiters.
 | |
| 
 | |
| <li value="11"><p><b>ARRAY</b> →
 | |
| The element is a JSON array.  The payload contains
 | |
| JSONB elements that comprise values contained within the array.
 | |
| 
 | |
| <li value="12"><p><b>OBJECT</b> →
 | |
| The element is a JSON object.  The payload contains
 | |
| pairs of JSONB elements that comprise entries for the JSON object.
 | |
| The first element in each pair must be a string (types 7 through 10).
 | |
| The second element of each pair may be any types, including nested
 | |
| arrays or objects.
 | |
| 
 | |
| <li value="13"><p><b>RESERVED-13</b> →
 | |
| Reserved for future expansion.  Legacy implements that encounter this
 | |
| element type should raise an error.
 | |
| 
 | |
| <li value="14"><p><b>RESERVED-14</b> →
 | |
| Reserved for future expansion.  Legacy implements that encounter this
 | |
| element type should raise an error.
 | |
| 
 | |
| <li value="15"><p><b>RESERVED-15</b> →
 | |
| Reserved for future expansion.  Legacy implements that encounter this
 | |
| element type should raise an error.
 | |
| </ol>
 | |
| 
 | |
| Element types outside the range of 0 to 12 are reserved for future
 | |
| expansion.  The current implement raises an error if see an element type
 | |
| other than those listed above.  However, future versions of SQLite might
 | |
| use of the three remaining element types to implement indexing or similar
 | |
| optimizations, to speed up lookup against large JSON arrays and/or objects.
 | |
| 
 | |
| ### 3.3 Design Rationale For Element Types
 | |
| 
 | |
| A key goal of JSONB is that it should be quick to translate
 | |
| to and from text JSON and/or be constructed from SQL values.
 | |
| When converting from text into JSONB, we do not want the
 | |
| converter subroutine to burn CPU cycles converting elements
 | |
| values into some standard format which might never be used.
 | |
| Format conversion is "lazy" - it is deferred until actually
 | |
| needed.  This has implications for the JSONB format design:
 | |
| 
 | |
|   1.   Numeric values are stored as text, not a numbers.  The values are
 | |
|        a direct copy of the text JSON values from which they are derived.
 | |
| 
 | |
|   2.   There are multiple element types depending on the details of value
 | |
|        formats.  For example, INT is used for pure RFC-8259 integer
 | |
|        literals and INT5 exists for JSON5 extensions such as hexadecimal
 | |
|        notation.  FLOAT is used for pure RFC-8259 floating point literals
 | |
|        and FLOAT5 is used for JSON5 extensions.  There are four different
 | |
|        representations of strings, depending on where the string came from
 | |
|        and how special characters within the string are escaped.
 | |
| 
 | |
| A second goal of JSONB is that it should be capable of serving as the
 | |
| "parse tree" for JSON when a JSON value is being processed by the
 | |
| various [JSON SQL functions] built into SQLite.  Before JSONB was
 | |
| developed, operations such [json_replace()] and [json_patch()]
 | |
| and similar worked in three stages:
 | |
| 
 | |
| 
 | |
|   1.  Translate the text JSON into a internal format that is
 | |
|       easier to scan and edit.
 | |
|   2.  Perform the requested operation on the JSON.
 | |
|   3.  Translate the internal format back into text.
 | |
| 
 | |
| JSONB seeks to serve as the internal format directly - bypassing
 | |
| the first and third stages of that process.  Since most of the CPU
 | |
| cycles are spent on the first and third stages, that suggests that
 | |
| JSONB processing will be much faster than text JSON processing.
 | |
| 
 | |
| So when processing JSONB, only the second stage of the three-stage
 | |
| process is required.  But when processing text JSON, it is still necessary
 | |
| to do stages one and three.  If JSONB is to be used as the internal
 | |
| binary representation, this is yet another reason to store numeric
 | |
| values as text.  Storing numbers as text minimizes the amount of
 | |
| conversion work needed for stages one and three.  This is also why
 | |
| there are four different representations of text in JSONB.  Different
 | |
| text representations are used for text coming from different sources
 | |
| (RFC-8259 JSON, JSON5, or SQL string values) and conversions only
 | |
| happen if and when they are actually needed.
 | |
| 
 | |
| ### 3.4 Valid JSONB BLOBs
 | |
| 
 | |
| A valid JSONB BLOB consists of a single JSON element.  The element must
 | |
| exactly fill the BLOB.  This one element is often a JSON object or array
 | |
| and those usually contain additional elements as its payload, but the
 | |
| element can be a primitive value such a string, number, boolean, or null.
 | |
| 
 | |
| When the built-in JSON functions are attempting to determine if a BLOB
 | |
| argument is a JSONB or just a random BLOB, they look at the header of
 | |
| the outer element to see that it is well-formed and that the element
 | |
| completely fills the BLOB.  If these conditions are met, then the BLOB
 | |
| is accepted as a JSONB value.
 |