In the real world, relationships matter: users have the orders, orders have line items, and directories have files and subdirectories.
Relational databases are specifically designed and this will not come as a surprise to you to manage relationships. Indexed search engines (ElasticSearch, AzureSearch with which Virto has the integrations) instead, like most NoSQL databases, treats the world as though it were flat. An index is a flat collection of independent documents. A single document should contain all of the information that is required to decide whether it matches a search request.
Both Relational and Flat worlds have advantages and disadvantages and we need to use them together and somehow, we need to bridge the gap between these two worlds.
In this article, we try to show the two different technics that we used in the Virto platform to address the problems with relationships between Relational and Flat worlds.
The problem description
Let’s imagine that we have a real requirement, need to provide for end-users the availability for full-text search among products that they once bought.
Application infrastructure looks like discribed on this diagram.
All information for user orders is stored in the relational database while products are stored in index (ElasticSearch or AzureSearch engines)
Let’s consider the two solutions, how we can enable the user to do full-text search among the products that he once bought.
Solution #1 - Denormalized data. Store with the product index the array of users identifiers that ever bought that product
To search within products that ever bought by a user called User1 , we would need to pass user IDs in a query
The main advantage of this method is that all data that can be used for filtration are stored in the index with main data and you might use a simple query to filter products.
The big disadvantage is the complexity of the indexing process, you need to update each product document in the index for any new order creation with an array of all user IDs that have ever bought this product. To do this, you must load all the orders that contain this product in order to get an array of user IDs for storing in the index.
Solution #2 - Application-side join. Run extra queries in SQL in order to use resulting data for products filtration in index queries.
We can (partly) emulate a relational database by implementing joins in our application.
To indexed search within products that ever bought by user has Id User1 , we would need to run two queries: the first would look up all products that user User1 had ever bought in order to find their IDs, and the second would pass those IDs in a query similar to the preceding one:
The main advantage of this method is that we don’t need to store the data for relationships in the search index and keep it on the consistent state with user orders.
The disadvantage is that you have to run extra queries in order to join documents at search time.
In this example, there were only a few products matched our first query, but in the real world we could easily have thousand of products that given user ever bought. Including all of their IDs in the second query would make for a very large query, and one that has to do thousand of term lookups.
In addition, some search engines have some limitations for the size of such terms queries
but there are some workarounds to overcome these limitations: