EAV Designs for Multi-Tenant Applications

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

EAV Designs for Multi-Tenant Applications

Nic Pottier
Howdy all,

We maintain a hosted multi-tenant system for a large number of users. Each user has what we call an "organization" or "org" which is their sandbox. In that organization they can manage contacts and define custom fields on those contacts, which then have values per contact.

We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have millions of contacts. We have tens of millions of values across organizations, so relatively sparse with a large number of tiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10 million contacts is unlikely.

What is the recommended way of modeling this in order to allow performant queries on the custom fields?

Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key and field value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the query planner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can also get painful, especially on the largest organizations with millions of contacts.

What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to help us narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per organizations to hold field values and dynamically creating indexes on that. The idea being that we won't kill our insert performance as much by only having one index per field to check on inserts and table statistics should still be ok. (how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have thousands (or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations are idle at any point in time.

Approaches we've thought about and questions / pros / cons:

1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB indexes per unique field. 
   pros: nice having the locality of data on contact, multi field queries are likely way better, query performance should be good
   cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific? insert performance suffers from having thousands of partial indexes (how badly)?

2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and indexes per unique field.
   pros: locality is nice again, multi field queries are better, query and insert performance should be good.
   cons: thousands of tables with up to 100 indexes per, is that going to blow up?

3) Create a table per organization `contacts_fields_[org_id]` which contains a `fields` JSONB column, Create JSONB indexes per unique field.
  pros: locality is good, multi field queries good, query performance should be good. Adding and removing fields is a bit simpler than 2) case above and naming can be a bit clearer than 1) as we don't have to worry about multi-org key name collisions
  cons: same as 2) but with a JSONB flavor, hundreds of JSONB indexes on thousands of tables, thousands of tables

4) Create a database per organization? Use 1) above
  pros: all data is localized, might see performance improvements elsewhere, query and insert performance should be good
  cons: mother of a refactor :)  what happens with thousands of databases on a single box? is this actually better from the perspective of getting to offload currently inactive orgs?

What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd love to hear success (and failure) stories from the community on how they've dealt with these.

Cheers,

-Nic





Reply | Threaded
Open this post in threaded view
|

Re: EAV Designs for Multi-Tenant Applications

Alban Hertroys-4

> On 12 Oct 2017, at 16:48, Nic Pottier <[hidden email]> wrote:



> We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have millions of contacts. We have tens of millions of values across organizations, so relatively sparse with a large number of tiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10 million contacts is unlikely.
>
> What is the recommended way of modeling this in order to allow performant queries on the custom fields?
>
> Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key and field value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the query planner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can also get painful, especially on the largest organizations with millions of contacts.
>
> What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to help us narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per organizations to hold field values and dynamically creating indexes on that. The idea being that we won't kill our insert performance as much by only having one index per field to check on inserts and table statistics should still be ok. (how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have thousands (or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations are idle at any point in time.

The table per customer approach is probably on the right track. It means you don't have to bother the query planner with fields that are specific to only a few organisations when they don't apply to the currently queried organisation.

I would go one step further and put all the fields common across all organisations into one master table and inherit that in each organisations specific table. If you add a check constraint on the inheriting table's organisation id, I expect that you can even make use of constraint exclusion. That basically turns the master table into a partitioned table, where each partition has it's own fields. I'm not 100% certain that constraint exclusion can work when partitions have different layouts, but I don't see why not - as long as the partitioning (check) constraint is on a field that is also in the master table.

> Approaches we've thought about and questions / pros / cons:
>
> 1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB indexes per unique field.
>    pros: nice having the locality of data on contact, multi field queries are likely way better, query performance should be good
>    cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific? insert performance suffers from having thousands of partial indexes (how badly)?
>
> 2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and indexes per unique field.
>    pros: locality is nice again, multi field queries are better, query and insert performance should be good.
>    cons: thousands of tables with up to 100 indexes per, is that going to blow up?

Perhaps not so much as you think. If an organisation has 100s of fields, they are probably not using a normalised view of their data. Since you're using separate columns now, each column can have its own data type (another con), and that adds the possibility to add references to/from other tables.

For example, if an organisation created fields address1 to address10, you can instead have an address table that references the organisation's specific contact partition, allowing them to add as many addresses as they need. If more organisations use addresses (very likely), you can use the same partitioning approach for each organisation's address table and put a foreign key constraint on each to the organisation's contact partition.
That puts multiple of your original fields under a single index.

I seem to recall that's pretty much how ToroDB goes about organising unstructured data on top of PG, although they probably use a couple more tricks than just that.

In fact, you could already apply this to your current design, although you probably wouldn't gain as much from it.

> What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd love to hear success (and failure) stories from the community on how they've dealt with these.

I'm not that familiar with JSONB, but wouldn't it be possible to create an index over an organisation-specific JSON 'blob' such that all fields in it are part of the index? I expect that index types aimed at text searches (word searches) would be useful in that respect.

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Previous Thread Next Thread