tl;dr
This started as a simple question and grew complex quickly. Feel free to skip the analysis and go straight to Takeaways.
If even that’s too much; yes. JSON is viable but SQL is still SQL. Index your data.
Background
Years ago, I was worked on a request management system with good friend and brilliant developer, Alan Samet. For auditing/archival purposes, the system had the common SQL pattern of using a main and history table per object type. This got tedious quickly.
Our release cadence was measured in hours and double-work had a huge impact. After a particularly heavy release, Alan declared that he was done with having to duplicate every database related task and went off to build a prototype object store, housing all our objects in a single table which would never need schema updates.
It was a smashing success.
Using XML to store our objects was one of the best design decisions we ever made. Our lookups were almost exclusively key-based making a single store for all types performant and simple. Schema updates were handled gracefully with de/serialization logic and adding new types was essentially automatic.
I still love that object store, but there’s one little niggle. I don’t like XML. It’s verbose, not terribly readable, and querying with XPATH expressions is extra effort. I want to use JSON instead, but is it viable?
I created a simple test project to compare JSON performance against traditional relational implementation. The source code can be found here.
Approach
For testing purposes, I created the following schema:
type Address = {
Id: Guid
Line1:string
Line2:string
City:string
State:string
Zip:string
}
type PhoneNumber = {
Id: Guid
Country: string
AreaCode: string
Number: string
}
type Person = {
Id: Guid
FirstName:string
LastName:string
Age:int
Address:Address
PhoneNumbers:PhoneNumber list
}
Person is the root object, with a child object (Address) and set of child objects (PhoneNumbers). The Address object serves our 1 to 1 testing, PhoneNumbers serves 1 to many.
I tested 5 storage patterns1 for JSON in SQL Server to test 3 major scenarios; INSERT performance, SELECT performance for 1 to 1 relationships (Person.Address.Zip), and SELECT performance for 1 to many relationships (Person.PhoneNumbers[*].Country)2.
The storage patterns are:
RawJson: A table with a single NVARCHAR(MAX) column for storing JSON. No indices or keys.
RawJson500: Exactly the same as RawJson, except that the column width is limited to 500.
JsonWithIndex: A table with an NVARCHAR(MAX) column for storing JSON data, and a computed/persisted column extracting a single 1 to 1 child property. The table has a single index covering only the computed column.
JsonWithDimensionTable: A combination of two tables, borrowing from a common data-warehouse pattern. The first table has two columns; an NVARCHAR(MAX) column to store the data and a computed/persisted column storing the Id of the root object and serving as the UNIQUE CLUSTERED PK. The second table contains two columns; a UNIQUEIDENTIFIER column with the Id of the root object and an NVARCHAR(5) column containing a 1 to many child property.
The second table effectively serves as an Index for the primary table. We minimize filtering costs by querying a subset of the data and use the identified keys to perform targeted selection of the primary table. The data in the second (dimension) table are populated with an INSERT TRIGGER.
Relational: A combination of three tables, this follows the most common pattern. Each object is stored in its own table with PK/FK relationships and the filtering columns are indexed.
Testing Scenarios
Please note that these are comparative tests. Performance will be discussed in terms of percentages of maximum time, not raw numbers. Absolute duration is not a useful metric here and if anything would confuse the findings. See more in the takeaways.
Bulk Insert
Bulk Insert testing covers all storage patterns and is a measure of how quickly we can insert 1,000,000 rows into a table. The tables and percentages are:
RawJson - 14.28%
RawJson500 - 11.73%
JsonWithIndex - 67.16%
JsonWithDimensionTable - 100.00%
Relational - 34.69%
These results are unsurprising. RawJson and RawJson500 are the fastest because they are performing the least processing-intensive means of inserting the data. Indexes take time to build and these tables have none. RawJson500’s advantage is almost certainly due to avoiding SQL Server having to check if each value can be stored in the page3, but it’s a minute advantage.
Relational comes in third place, beating out JsonWithIndex. Why would storing over 3 million rows in tables with a combined 5 indices be faster than 1 million rows in a table with a single index? Two factors seem to be at play. First, where storing the JsonWithIndex data sent ~765MB to the server, storing the Relational data sent ~226MB. Second, the Relational indices are not computed, creating a savings in overall processing. Sending 3/10ths the volume of data and minimizing additional computations appears to provide a significant performance improvement.
JsonWithDimensionTable is predictably last. Each insert operation carries the full data volume, runs an insert trigger, performs a CROSS APPLY on OPENJSON on a child property of the root object, and stores the needed data in the dimension table. It’s a pattern which maximizes the insert cost to minimize the read cost.
Select (1 to 1)
This scenario covers the RawJson, RawJson500, JsonWithIndex, and Relational storage patterns. It selects Person objects, filtered by the value of a child object’s property. The select patterns and percentages are:
RawJson - 100.00%
RawJson500 - 76.69%
JsonWithIndexNoForce - 41.75%
JsonWithIndexForce - 57.72%
Relational - 56.62%
An important initial observation is that the disparity in performance is less for this scenario than with Bulk Insert. Where the longest Bulk Insert time was roughly an order of magnitude greater than the shortest, here we see optimizations yielding a maximum 2.4x improvement.
RawJson is the worst alternative, followed by RawJson500. I’m surprised at the difference in performance between the two. Both queries filter the data by a computed JSON_VALUE and I wouldn’t expect a performance gain from using a narrower column when the actual data share the same size.
JsonWithIndexNoForce, JsonWithIndexForce, and Relational have nearly equivalent performance but comparing additional metrics is revealing:
JsonWithIndexNoForce - 41.75% Total
Deserialization - 75.43%
Execution - 23.29%
JsonWithIndexForce - 57.72% Total
Deserialization - 46.95%
Execution - 50.55%
Relational - 56.62% Total
Deserialization - 100.00%
Execution - 17.27%
The Relational query consists of selecting 3 datasets, Person joined to Address, Address, and PhoneNumber joined to Address4. Although it has the lowest Execution Time, this is completely eclipsed by the Deserialization time. I may have chosen a suboptimal means of converting the DataTables back into POCOs but even a 3x improvement in deserialization speed would only put Relational on par with JsonWithIndexNoForce.
JsonWithIndexNoForce and JsonWithIndexForce are clearly viable. The difference between the two is that one explicitly tells the SQL Engine to use the table’s index. Without forcing the index, SQL Server uses a table scan to filter and select. Forcing the index causes SQL to perform an Index Seek for the filtered set and use an RID Lookup to select the Json values.
In this instance, it’s faster to scan an entire table for the targeted rows than use an index. Further, a table scan filtering by a computed/persisted column is easily on par with traditional Relational querying.
Lastly, none of these queries executed in timeframes acceptable for transactional querying (min 2.41s). I suspected that was due to the large data sets (min 28.33MB, max 95.95MB). I created a secondary set of queries for each scenario where the records were counted instead of returned.
RawJson - 100.00%
RawJson500 - 75.33%
JsonWithIndexNoForce - 0.23%
JsonWithIndexForce - 0.20%
Relational - 3.19%
Querying using JSON_VALUE to compute the values used to filter appears to be non-viable for a transactional system. Querying using indexed data is viable for Json and Relational data.
Further, I did some freehand checking of single-record select performance5. Pulling a single record using the JsonWithIndexNoForce, JsonWithIndexForce, and Relational queries targeting a unique record was highly performant. Interestingly, JsonWithIndexNoForce still avoided the index and used a Table Scan, yet was as good or better than Relational.
Select (1 to Many)
This scenario covers the RawJson, RawJson500, JsonWithDimensionTable, and Relational storage patterns. It selects Person objects, filtered by the value of a child object’s property, where there may be 1 to 3 child objects per person. The select patterns and percentages are:
RawJson - 55.59%
RawJsonNoCrossApply - 100.00%
RawJson500 - 56.92%
JsonWithDimensionTable - 5.93%
Relational - 5.45%
The disparity in performance is also worth acknowledging here. We find the greatest disparity (nearly 20x) in the best and worst performance.
RawJson and RawJson500 both use CROSS APPLY OPENJSON to open the Person’s PhoneNumbers collection and JSON_VALUE to filter based on the PhoneNumber’s Country property. All of these are expensive operations.
RawJsonNoCrossApply is a naive triple filter using JSON_VALUE. Rather than treating an array of child objects as a table, we compare all three potential PhoneNumber.Zip patterns to our filter value. The performance is abysmal.
Let’s put this in perspective. Our 1 to 1 testing returned (min 28.33MB, max 95.95MB) of data. This test returned (min 0.29MB, max 1.18MB). Though returning a dramatically lower volume of data, these three queries were the slowest by an enormous margin. Consider them grave warnings on what not to do.
JsonWithDimensionTable and Relational were both fine, with roughly equivalent performance. Digging in further:
JsonWithDimensionTable - 5.93%
Deserialization - 14.66%
Execution - 5.91%
Relational - 5.45%
Deserialization - 100.00%
Execution - 5.27%
Deserialization time may seem like an important differentiator but the overall range was ~31ms. It simply does not matter. Both of these select patterns were effectively interchangeable for performance. In both queries, we’re running much the same operations.
JsonWithDimensionTable selects from the primary table (the JSON store) joined to the dimension table. Although a Table Scan is run on the dimension table, the table stores data as an index would. A Clustered Index Seek on the primary table then allows us to select the data efficiently.
Relational performs much the same process, using Index Seeks, Clustered Index Seeks, Clustered Index Scans, and Key Lookups. In both cases, we’re effectively joining indices to tables for performant querying.
This equality of performance may change with scale. JsonWithDimensionTable performs a single join, where Relational performs three (arguably four). That’s a win for JsonWithDimensionTable. Relational is reading/transmitting much less data. That’s a win for Relational.
Takeaways
Bottom Line
None of these scenarios are perfectly optimized and there are still countless tweaks to try. I think, however, we have enough information to answer my main question. Yes, JSON is viable in SQL Server. Sufficiently viable that you can use SQL Server as a transactional document store.
Your Mileage May Vary
I tried to avoid talking about time in absolute terms because those numbers are not durable. I ran these tests on two machines; an M2 Macbook with a puny 8GB of RAM, and my (admittedly old) intel i7 workhorse with 32GB RAM. Different machine, different results.
Even on the same machine, timings would vary between runs. General trends were reliable but between background processes and the fact that the seed data (and therefore the result sets) were randomized, the timings were not relevant.
Your objects may be larger, smaller, more, fewer, more complex… You can’t build high performance systems based on absolute rules. Always prototype and respond to the current need.
SQL is SQL is SQL is not Relational Data Storage
If you don’t want to store data in a relational pattern, first bet is always to use a Document Database. This was a test ground for a more ambitious project, an edge case involving combining relational, document, and event sourcing into a single database.
The way to use SQL Server well is to understand the engine, regardless of the storage pattern you apply. You don’t have to store related data in separate tables but you do have to provide the infrastructure required for queries to run quickly. You cannot get away from indexes.
Further, the maxim holds true. You can take a hit on read or write but you must take a hit. Even storing data in the traditional relational pattern took a hit compared to the naive JSON patterns. Depending on what you’re storing, how it’s stored, and how it’s retrieved, the Relational pattern can be suboptimal.
I’m still curious about another pattern. (Okay, a lot of other patterns.). Given how well a table scan performed on JsonWithIndex, if we removed the index and, instead of computing the filter column, we inserted the JSON alongside the filter value, would that beat Relational on insert and select?
The more you understand the SQL engine, the more things you can do better. This includes atypical storage and access patterns.
Watch Your Size
Here’s a sample JSON object:
{
"Id": "3cf1df8e-03ad-4d8d-80b5-ba76bbe39938",
"FirstName": "John",
"LastName": "Doe",
"Age": 45,
"Address": {
"Id": "6531357d-c2e9-473b-98a9-535e26b88bac",
"Line1": "123 Wallaby Way",
"Line2": "",
"City": "Sydney",
"State": "NSW",
"Zip": "96862"
},
"PhoneNumbers": [
{
"Id": "e1511c81-e8d9-49d4-990a-37ee47f2005e",
"Country": "491",
"AreaCode": "776",
"Number": "788-2529"
},
{
"Id": "e97b3436-5f05-4ca1-b602-290e2f97d122",
"Country": "298",
"AreaCode": "776",
"Number": "788-2529"
}
]
}
Roughly half the characters are JSON formatting. Additionally, we’re storing everything as NVARCHAR(MAX), eliminating the chance to save bytes with compact data types.
Here’s the total space used for each storage pattern:
RawJson - 810.82MB
RawJson500 - 810.94MB
JsonWithIndex - 813.07MB
JsonWithDimensionTable - 1,312.77MB
Relational - 390.91MB
Storage is cheap these days, but 2-3x times the space is a lot.
Also, volume of data affects query speed. This is true of Document Databases too.
Just Use NVARCHAR(MAX) for JSON
Notice how RawJson500 uses more space than RawJson? Any idea why?
There’s a reason the Microsoft Docs examples use NVARCHAR(MAX). Does limiting the column width provide increased efficiency? Possibly. If there is an advantage, is it worth database errors? No.
There are real optimizations to be made. Go for something with a bigger impact and trust that your objects get stored correctly.
Thanks for Reading
If you’re still here, wow. Thanks! It’s been a while since I had a proper geek out and this has been fun. If you liked this (or any of my writing) please consider subscribing, sharing, liking, commenting, throwing a brick through my window with a supportive note. It’s the only way I can tell if I should keep at this, or if I’m just screaming into the void.
There are numerous additional optimizations I could have tried. What was meant to be a quick test grew in complexity beyond that scope. Please feel free to submit a PR if you’d like to add a scenario but I probably won’t pursue much farther without restructuring the project to be more reusable and configuration based.
I also mirrored the SELECT queries as COUNT queries to see if there was a major impact for data on the wire. Given that I’d need to test that across connections of different quality and latency, I won’t include that in the analysis for what’s meant to be a simple exercise.
When SQL Server has a column with NVARCHAR(MAX), the value can be stored in or out of page, depending on whether the total row size exceeds maximum. All records were small enough to fit in the page but SQL Server still has to check each value.
I experimented with using joins for the relational data vs querying the Address table for the join keys, storing them in a temp table, and using that to filter the other tables. Performance was roughly equivalent across multiple tests.
This was done in Azure Data Studio and not in the main project to avoid refactoring. I arbitrarily set a single zip code in the relevant sets to “00000”, making use of existing indices.