9/23/22

Data Modelling, why is it important to understand logical and physical...

Coming from a 200K+ IT Consultants organisation, and at least 25 years of solving the "performance issues" after "stuff went into production" I'd say it depends, but usually the Projects lacked a Data(base) Architect during the architectural design, early design/implementation phase.

I usually say it's a database, not an objectbase. When we model data for a solution we do it in multiple layers. Usually it starts with business objects like Customers, Suppliers, Products, Orders, and then it comes down to Classes, that need "storage" by an Entity Model (Yes, any ORM would go but as long EF is without "Core" I'm happy). This logical data model need physical support, thus a database. Basically every logical data model object need a storage object, at the lowest level a Table, which gives Us a one to one relation between Entity and Table.
And this is how it usually ends, I'm not saying this is right. I'm only saying this is where it usually ends up.

Back to my saying this is a database, not an objectbase. Databases have data storage objects (Tables), and data access objects (Procedures, Views, I'm leaving out the functions here). The application want to have something to feed the Entities from the Tables, this is where we use views as proxies between the logical and physical model as we probably want to model the data (This is what most people would call normalize data, I use the terms of normalize, generalize, and denormalize and care less about what xNF we get. My goal is the most optimal physical datamodel for performance, data integrity, size.) to support the R in CRUD(Views for -R--). And then add Procedures for the C-UD in CRUD. Basic rule is one table and Three procedures per Entity, let us call the View by the EntityName, and the procedures ins-, upd, and delEntityName, and then add them as methods for the entities(EF supports this, EFCore doesn't!). And then add instead of triggers on the views launching the procedures for the developer not using the entity model.

So what is the problem of this approach, developers cannot change entity models without understanding what would happen with the database objects. So, if they after a release want to add attributes to an entity they need to change a table to add a column, change the view to handle the column, change three procedures, and then change the triggers. This would suck big time... To mitigate that problem the developer can create a supporting class for the minor release, think patch. In practical terms, create a new class with the keys, and additional attributes, getting a table for that, running CRUD against that table. And then, when the Product reaching a Major release merge together. But with this in mind, how ofter does a datamodel actually change after the release? I'd say not often, small changes, really seldom...

And finally, what are the benefits. First that people gets in mind is performance (I always get Premature Optimization n my face every time I bring this up.), and we get data integrity. Both in terms of data security, and data quality. With security we can make sure that only the right user can reach the right data, all the way down on row/column level, actually all the way down to every single data point we might have. But we also get higher quality in the data as we have a chance of applying a "higher level of normalization" as we can normalize, generalize, and denormalize data with the focus on data, not business objects.

As an example, take a Contacts table containing whatever needed for sending postcards, covering everyone in the World. We would need attributes like this: ContactID, FirstName, MiddleName, LastName, Address, Zipcode, City, County, State, Country, Continent. See this as a conceptual model, it doesn't need to be perfect...

Let us estimate a logical row to something between 20 and 500 bytes depending on the strings, and let us say that the average string is 9 characters, and we go by unicode giving us an average of 18 bytes per string, and that we have 10 strings and one key which need to support at least 8.000.000.000 individuals and 2.000.000.000 companies, that is 10.000.000.000 rows in the table. SQL Server has a storage level called data page that limits how many rows can be fetched through an IO operation. A data page is 8kB, and can store a single physical row up to 8060 B.
To calculate the rows per page we simply divide 8060 bytes per page / bytes per row giving us rows per page. To get the size we divide rows per page with rows per table giving us the result in pages per table. And as every page is 8 kB we just multiply with 8 and get kB per table.
Legend:
B = Bytes, kB = Kilobytes, r = Row, p = Page, t = Table, r = Rows, Br = Bytes per Row, Bp = Bytes per Page, kBp = Kilobytes per Page, rp = Rows per Page, pt = Pages per Table, rt = Rows per table, Bt = Bytes per Table, kBt = Kilobytes per Table
Formulas:
Bp / Br = rp (Round down)
rt / rp = pt (Round up)
pt * kBp = kBt
10.000.000.000 rows per table and 20 Bytes per row:
403 Rows per page, 24.813.896 pages per table, 193.859 MB in the table
10.000.000.000 rows per table and 100 Bytes per row:
80 Rows per page, 125.000.000 pages per table, 976.563 MB in the table
10.000.000.000 rows per table and 500 Bytes per row:
16 Rows per page, 625.000.000 pages per table, 4.882.813 MB in the table
Breaking this table up in a Contacts, Streets, Cities, Regions
Regions(RegionID, State, Country, Region)
Cities(CityID, City, County, RegionID)
Streets(StreetID, Street, Zipcode, CityID)
Contacts(ContactID, Names, StreetID, StreetNumber)
Makes this probably down to 60% in size.
This is normalization!
There is a lot of strings, create a string table and assign StringIDs to the strings in the entity tables. Let us do some generalization! I add some rough estimates
Strings(StringID, String), 5.000.000 unique strings, avg row 20 B, total of 97 MB
Regions(RegionID, StateStringID, CountryStringID, RegionStringID), 10.000 states, always 20 B per row, total of 200 kB
Cities(CityID, CityStringID, CountyStringID, RegionID), 250.000 cities, always 20 B per row, total of 5 MB
Streets(StreetID, StreetStringID, Zipcode, CityID), 25.000.000 streets, always 25 B per row, total of 607 MB
Contacts(ContactID, FirstNameStringID, MiddleNameStringID, LastNameStringID, StreetID, StreetNumber), 10.000.000.000 contacts, always 38 B per row, total of 368.514 MB
Adding this together gives us a model of 368 GB. Compare 976 GB to 368 GB is a new model that is less than 38 % of the original model.
And then we have a really meshy model, so let us denormalize some…
Strings(StringID, String), this is the same as before, 97 MB.
Addresses(StreetID, StreetStringID, Zipcode, CityStringID, CountyStringID, StateStringID, CountryStringID, RegionStringID), always 42 B per row, total of 1023 MB.
Contacts(ContactID, FirstNameStringID, MiddleNameStringID, LastNameStringID, StreetID, StreetNumber), same as before, 368.514 MB
This final model is probably somewhere in the 35-40% range of the original model giving us approximatly 3 times so many business objects, in the same amount of RAM in cache on the SQL Server. And we will have a more slow, linear growth in in the database when we add content instead of an exponential growth.
And then we have the views and procedures making the model work from the business layer.
Long reply, adding to your story. Hoping to get the developers to understand the importance of a difference between an entity and/or class in the business layer, and a table in the database. There is no relation at all in between, as the business layer works with business objects, and in the database we work with data.