Proximity Service
Functional Requirements
Who are our end users? two sides: toB and toC
Serving side and ingestion side.
What is the search radius? What's the maximum radius allow?
How instantly do we want to update the business information
Non-functional Requirements
Highly available
Low latency
consistency requirements?
Read > write
QPS
Assuming we have 100M users and 5 search queries a day.
100M * 5 / 10^5 = 10^8 * 5 / 10^5 = 5000 QPS
APIs
Response:
[business1, business2, business3 ...]
Data Schema
Design Options
Option 1: Store the business with only longitude and latitude
Plain query over longitude and latitude for:
Option 2: Evenly divided grid
Segment the entire map into number of evenly divided grid.
Query -> We will just look for the segment that user location belongs to.
Pros:
More efficient compared to option 1.
Cons:
For each grid, there might be unevenly distributed number of businesses.
If user zoom in/out, this is not very flexible to show number of businesses at different zoom level.
Option 3: Geohashing
Reducing the two-dimensional longitude and latitude data into one-dimensional string of letters and digits. Recursively dividing the world into smaller and smaller grids with each additional bit.
Pros:
Very efficient and can fit any precision use cases.
Not very straightforward to implement but luckily we have a lot of out-of-box libraries/solutions.
Option 4: Quadtree
Build a in-memory quadtree by partitioning the two-dimensional space by recursively subdividing it into four quadrants until the content of the grid meet a certain criteria, for example, 100 businesses maximum.
Geohash vs Quadtree
Easy to use and implement, No need to build a tree
Need to build a tree, harder to implement
Grid size is fixed. Support returning businesses within a specific radius but not k-nearest businesses
Good fit for k-nearest businesses it can automatically adjust the query range until it returns k results.
Precision is fixed, grid size is fixed, cannot adjust grid size based on item density.
Dynamically adjust the grid size based on population density.
Update/Remove a business is as easy as deleting that geohash record.
Updating index is more complicated than geohash. If a business is removed, we need to traverse from root to leaf node in order to remove business. Locking mechanism is also required if multiple threads are modifying it. Also need to think about rebalancing the tree, A possible fix is to over-allocate the ranges.
Business Table
id
string
name
string
longitude
float
latitude
float
geohash
string
Serving Algorithm
Convert user's location to a geohash with a precision based on the radius.
Start with geohashes with same prefix as user's location, calculate neighboring geohashes and add them to a list.
For each geohash in the list, fetch businesses:
Filter these results by calculating distance between each business to user's location and only keep businesses that are within the search radius.
Rank result list and return to client.
High Level Diagram
Caching
Caching is not a solid win because:
The workload is read-heavy, the dataset is relatively small. The data could fit in the working set of any modern database server. (1.7GB), the queries are not I/O bound and they should run almost as fast as in-memory cache.
If read is bottleneck, we can add more read replicas to improve read throughput.
Cache key selection
Location coordinates (latitude, longitude).
Cons:
location returned from device not always accurate, will change slightly every time.
user can move
hit rate is terrible if we use location.
Geohash and business id
geohash
a list of business ids
business id
business entity
According to requirements, user can select different radius: 500m, 1km, 2km and 5km. Those radius mapped to 4, 5, 5, and 6 for geohash length. We can cache data on geohash#precision like geohash_4, geohash_5 and geohash_6.
Memory
Redis storage: 8 bytes x 200M x 3 precisions = 5GB
Last updated