Distributed Search
Last updated
Last updated
User enters search query
Search engine finds relevant feeds.
User can see a list of feeds.
Highly available
Low latency
Low cost
DAU is 3M for search. The number of requests a single server can handle is 1000.
3k servers.
The size of a single document: 100 KB
Number of unique terms from a document: 1000
Index table storage: 100 bytes per term
One document storage = 100KB + 1000*100B = 200KB
6000 documents a day = 6000*200KB = 12*10^5KB = 12*10^8 = 1.2GB
100B pages * (2MB / page) = 200PB
Sharding by url.
Use global index to map hash onto a url.
Use inverted index to map searched words onto feeds.
Get page by URL
Get page by hash
Search for a word.
Term
list of term to
[doc, freq, location]
doc
identifier of the document
freq
frequency of the term in the document
location
positions of the term in the document.
Separate base-search for storing index and keep them in k8s StatefulSet -> Low Latency. (hard drive storing search index)
Reorder using L2 ranking in midway search. midway search communicates with Redis cluster for information. This information contains key: product_id, value: float array for product feature.
build reverse index / lucene index. It gets JSON document as an input and release index segments as an output.
An inverted index is a hashmap that employs a document-term matrix.
key: term, value: list of [doc, freq, location]
doc: a list of documents in which term appeared.
freq: a list that counts frequency with which the term appears in each document.
loc: a two-dimensional list that pinpoints the position of the term in each document.
The mapreduce framework is implemented with help of a cluster manager and a set of worker nodes as mappers and reducers.
The manager initiates the process by assigning a set of partitions to mappers. Once the mappers are done, the cluster manager assigns the output of mappers to reducers.
Extracts and filters terms from the partitions assigned to it by the cluster manager. These machines output inverted indexes in parallel which serve as input to reducers.
combines mappings for various terms to generate a summarized index.
posting lists traversal, text relevancy scores computation, data extraction from DocValue fields.
similar to reverse index, the list contains information about each occurrence of the term in the document collection.
Document ID where the term is found.
The position of the term within the document (for phrase and proximity queries)
Additional metadata like term frequency.
How does it work?
Query processing When a search query is received, the search engine breaks it down into individual terms.
Retrival of Posting Lists For each term in the query, the search engine retrieves the corresponding posting lists. This process involves accessing the index, on disk or on memory.
Traversal of Posting Lists The engine traverses lists to identify documents contain the query terms.
How does it work?
Term frequency (TF): More often a term appears in a document, more relevant that document is to the term.
Inverse Document Frequency (IDF) Measures how common or rare a term is across all documents in search index. Terms that appear in many documents are less significant for determining relevancy, so they receive a lower IDF scores.
This is a combination of TF and IDF. It is used to reduce the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely. TF-IDF is one of the most traditional methods for computing text relevancy.
BM25: A more advanced and widely used algorithm in modern search engines. It improves upon the basic TF-IDF model by incorporating probabilistic models and handling limitations like term saturation.
indexing is the organization and manipulation of data that's done to faciliate fast and accurate information retrieval.
An inverted index is a hashmap that employs a document-term matrix.
key: term, value: list of [doc, freq, location]
doc: a list of documents in which term appeared.
freq: a list that counts frequency with which the term appears in each document.
loc: a two-dimensional list that pinpoints the position of the term in each document.
Give possibility to make improvements on whatever level.
Independent layers allow you to scale them independently.
The minimum part of the index is the Lucene segment. No way we can update a single document in place. The only option is commit a new index segment containing new and updated documents. Each new segment affect search latency since it adds computations.
To keep latency low, you have to index commits less often. which means there is a trade-off between latency and index update delay. Twitter engineers made a search engine called EarlyBird which solves this by implementing in-memory search in the not-yet committed segments.
On Elasticsearch, it employs techniques like merging smaller segments into larger ones, which can improve search performance but also introduce latency during merge process.
To save the network bandwidth limitation for downloading indices from indexer to search layer.
It affect where you cache your search results. Previously we cache search result for "microwave" query for users in Moscow and resue these results for further requests. Now we want more refined cache. Bob prefer "Toschiba" and Alice prefer "Panasonic" in our ranking.
We stopped using cache on search backend, but rather on ranking.
Lucene using LSM tree (Log structured merge tree)
Before segments are flushed or commited, data is stored in memory and is unsearchable.