-
Elastic Search forms with index
-
Index (Logical namespace / DB Name)
- Divided into Shrads (Types / Table / Schema)
-
Each shard has replica
-
Each shard is Lucene index
- For inverted search
-
Divided into Segments
- Inverted index
-
Segments are immutable
-
Apache Lucene
- Powerful open-source full-text search library
-
DB normalize
- Case disposal, punctuation
- Common word remove
- Stop word list
- Lemma
- Mapping different usage to one
-
Tokenization
- Stemming > Buying > Buy
-
After 1 second index refresh happens
- Doc indexed
-
Translog (30 min or 512mb)
-
Master node and Data Node available
-
Inverted index
- A document is the unit of data in Elasticsearch and an inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms, and associating a list of documents with where the word can be found.
- Record level inverted index (word -> record)
- Word level inverted index (word -> record, position )
- A document is the unit of data in Elasticsearch and an inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms, and associating a list of documents with where the word can be found.
- Porters stemmer algo
- The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English.
- Its main use is as part of a term normalization process that is usually done when setting up Information Retrieval systems.
- Murmur3 hash function
- Why indexing?
- Database blocks stored like a linked list
- For searching N blocks
- when not sorted searching requires
N/2
access avg - when duplicates then
N
- when sorted
logN
- when not sorted searching requires
-
Creating an index on a field in a table creates another data structure that holds the field value and a pointer to the record it relates to.
-
This index structure is then sorted, allowing Binary Searches to be performed on it.
-
Given our sample database of
r = 5,000,000
records of a fixed size giving a record length ofR = 204
bytes and they are stored in a table using the MyISAM engine which is using the default block size B =1,024
bytes. -
The blocking factor of the table would be
bfr = (B/R) = 1024/204 = 5
records per disk block. -
The total number of blocks required to hold the table is
N = (r/bfr) = 5000000/5 = 1,000,000
blocks. -
High cardinality
- Better indexing
-
Type of index
- Primary index
- Dense index
- All records map to index
- Sparse index
- Few records map to sorted index
- Sparse index is not possible if records are not organized/sorted
-
Graph database which considers relation and data secondary index
- Low cardinality in the database
- Common attribute
- Easy for indexing
- Made efficient through bitmap indexing
- DB Normalization - Denormalization
- A list of words contained in a document. (Forward index)
- A list of documents containing a word. (Inverted index)