Sunday 27 January 2013

MVC projects encourage poorly organised domain namespaces

The asp.net MVC framework has been a huge breath of fresh air having usurped Webforms as the standard framework for web applications. Everyone knows the MVC pattern wasn't created by Microsoft, and the current implementation of asp.net MVC has some fundamental differences to that of the original pattern found in SmallTalk. But, it has still provided a more lightweight, transparent, extendible and testable platform to build on then Webforms ever was. And its feels so much better suited to the Web. So, regardless of how much it was influenced by Ruby on Rails, I'm grateful that Microsoft were able to bring MVC under the asp.net umbrella.

With that said, I have a small gripe with with MVC. It's a very small gripe and one that can really only be loosely attributed with it, but its the most obvious place to point the finger for this sake of this article: I think the asp.net framework encourages developers to organise domain namespaces poorly.

First, let me explain what I think a good domain namespace organisation should look like - largely learned from Domain-Driven-Design. Domain namespaces (essentially, folders in the domain project) should expose the purpose of the system. The names should closely and honestly reflect the primary concepts of the domain model. The language of the domain model should be reflected in code - this includes the namespaces the code resides in. If we are designing an application based on a domain model described by the Ubiquitous Language we have unearthed in communication between developers and business experts, then our design must share this language. A well organised namespace should allow the Business experts working on the project to look at the root of the domain and should be able to recognise the key concepts in the names of the root level folders.

In any domain, you will always need some classes as part of your design that do not come directly from the model. These classes commonly come as part of a design pattern being implemented. Their presence makes the communication of the pattern clearer, usually by encapsulating away the details, such as in the Façade Pattern. Such classes usually come with a suffix to describe their purpose, such as: Provider, Factory, Repository, Builder or Manager. It is when domain namespaces become oriented around these classes that are external to the model that organisation has failed. A domain namespace should not describe the purpose of the classes it contains. We should never end up with folders in our domain project called Providers, Factories, Repositories, Builders, or Managers because this is never what our domain is about. Yet this is exactly what MVC encourages.

Because MVC is a Framework (and not a library) it gives us handles to work it. Asp.net MVC is useful largely because it gives us answers to questions like "where do I put a new Controller?" or "where do my Scripts go?". With MVC, you know where to add new Models because there is a folder called Models. It doesn't matter that those (view) models may relate to completely distinct concepts, they can still exist in the same folder - and they're only view models anyway, right? They don't tie directly to the domain model, they're just there to make the presentation layer work.

Right. However, the problem is that I see developers make the mistake of following this convention in the domain project. The model is ignored and folders are created to categorise a sub collection of classes that all happen to be "Factories" or other such classes external to the model. The domain project then fails to be decipherable to a business expert and no longer describes itself with the Ubiquitous Language.

Presumably developers do this because Microsoft gave us the template MVC project as is and because its just easier to put that new Factory in the Factory folder with the others. It'll be easy to find later. But there-in lies the danger: If we are not forced to think hard about how to arrange the namespace then we risk designing something that is not true to the model. If we have to think before we add a class, it makes the choice to add a new class tougher. We will only then be able to authorise the creation of classes that have an appropriate place in the namespace of our domain - keeping the design as close to the model as possible. If we unearth a purpose for a class that doesn't appear in the model (and also isn't an artefact of a design pattern) then we must revisit our model and communicate the idea with our business experts and incorporate the idea in the Ubiquitous Language.

In defence of MVC, it couldn't really work any other way. MVC is a presentation framework - its not supposed to be a domain framework (if there is such a thing?). It has license to organise the way that suits this purpose. It is the developer's responsibility to organise that the domain suits its purpose (according to the model). Just because MVC has a layout that groups together its classes by their purpose, does not mean any other project should. The real issue is to make sure that developers are in tune with the domain and to ensure that it is exposed in the design, and not use the MVC framework as a guiding tool for how to do that.

Even inside the domain, perhaps a layer or two in from the root, there is a place for arranging classes that are external to the model together. If the development team is comfortable that nested inside folders (named according to the model) that it is appropriate to name a folder 'Factories' - then so be it. But I would expect to find such folders a rarity. (Just how many Factories would really be needed in one corner of the domain?).

But maybe there is something to learn from MVC in how we organise the domain? With MVC3 came the concept of Areas. The goal of an area is to provide logical Url Routing and help organise the presentation layer classes and assets according to the routing rules. If you have a number of Areas in your MVC3 project, then the Areas folder starts to look like a well organised domain. All the root folders will speak in terms of the domain model, while sub folders will house Models, Views & Controllers and be named as such. So, if there is anything MVC can do to encourage better domain namespace organisation, it is to include Areas in all template MVC projects.

Sunday 20 January 2013

Query Optimising by Reducing Database Round Trips

We'd been suffering from a poorly constructed query that resulted in multiple trips to our database. I can say it was poorly constructed because I wrote it. I knew exactly how inefficient it was. It worked, which was enough to go release early with. But that knowledge of a slack query in one of our core systems was enough to make me take a few days to tighten it up. This is how I went about improving that query.

The Problem
There are two entities involved our query: a Pixel and a UrlTagContainer. Pixels hold a small amount of html that needs to be returned along with the response of the requested Url for a webpage. Think of the system as a micro CMS for invisible content. Pixels contain the tiny (pixel-sized) bit of content in one property and a string collection of Tags in another property. UrlTagContainers contain a Url property and also have a string collection of Tags. Pixels are linked to a Url by these Tags. The query needs to retrieve the correct Pixels for a given Url where one or more of the Url's Tags intersects with the Tags of the Pixel. In addition, if a Pixel has no Tags then it should always be returned by the query.The benefit of this relationship is that it makes it possible to group pixels together. The application of Pixels to a Url can be managed through a Tag, so as to break the linear relationship between Pixels and Urls.

As stated, the query was inefficient because it made two calls to the database per single attempt to get the results. The first step was to retrieve the Tags applied to the Url. The second step was to find all the Pixels with Tags that intersected the results of the first step (or had no Tags at all). The graph below shows typical throughput over 24hrs. throughput:


The Database
Our pixel system uses RavenDB. Instead of managing our own instance we are using RavenHQ. We license a database adequate for the volume of data we had in production - but not the throughput we were generating. It was RavenHQ who first alerted us to the high traffic and suggested we use Aggressive Caching. This we did, but it always felt like papering over the cracks for this problem, because we knew where the real issue was.

The Solution
The goal was to reduce the number of calls to the database per request. I had to query the database in such a way that the results would contain the Pixel as well as the Url(s) that the Pixel had intersecting Tags with. This would allow me to filter on this projected 'MatchedUrls' property.

Sounds straight forward. However, the challenge lay in thinking in terms of RavenDB Indexes - as opposed to the old SQL statement. To get the results in the shape I needed required the index to perform several steps. Roughly, the index flattens out the data, then pulls it back together and returns the result. To help me understand what was happening at each step I wrote out the exploded data. I have posted the individual parts of the index with an explanation of what is happening and a view of the results at each step.

First off, the raw data. We begin with a collection of Pixels and a Collection of Url Tags. (I recognise that RavenDB doesn't store the data in a tabular fashion, but this is simply the easiest way to think about this...)

Pixel Id Tags
1 abc
2 def
3 abc, ghi
4
5 def
     
Url Tags
jim.com ghi
bob.com abc, def
tom.com def
ann.com def
pam.com ghi

Every RavenDB index require's a Map function. This Map function works with a particular data type (of which there is a collection of in the database) produces some output of that data. The output of the Map can be the end result of the query, or it can feed into subsequent steps of the Index. We will be further using the output of our Map.

Infact, because we are working with two object collections here (Pixels and UrlTagsContainers), the index needs have multiple maps. To accomplish this our index needs to be an instance of AbstractMultiMapIndexCreationTask. The objective of the Map steps in our index is to project each Tag from inside the Pixel or UrlTagsContainer entity into its own object. In other words, we need to find a way to aggregate two different types into the same object type - one for each Tag across the two collections of objects.

The object that we would like needs to contain our three key elements: The Tag, the Pixel Id and the UrlTagContainer Url. When mapping Pixels, we know that only two of these properties will be populated: Tag and Pixel Id. We can leave the Url blank at this stage. For the UrlTagContainer we can take on the Tag and the Url. The Pixel Id will be left empty. After the two map steps have taken place. The data is pulled together into one set of results of the single projected data type. The Maps of the index are shown below, followed by how the data will look after the Maps have been applied:

Tag Pixel Ids Urls
abc 1
def 2
abc 3
ghi 3
_notag_ 4
def 5
ghi jim.com
abc bob.com
def bob.com
def tom.com
def ann.com
ghi pam.com

This collection of data is perfect for performing a grouping function on. That is exactly what we do in the Reduce step of the index. The group key is obviously the Tag. We group on the Tag and then 'SelectMany' on both the Pixel Id collection and Url collection properties of the projected object to flatten them into a single list of Pixel Ids and Urls for each Tag. The result of the reduce therefore is a collection of all the Tags, where each Tag only exists once in the list and is accompanied by the Pixel Ids and Urls to which the Tag is applied. Best described as in the table below:

Tag Pixel Ids Urls
abc 1,3 bob.com
def 2,5 bob.com, tom.com, ann.com
ghi 3 jim.com, pam.com
_notag_ 4

A Map and Reduce function is usually the end of a RavenDB index. However, the result in its current format is not enough to query on yet. We have to make use of the TransformResults Function of RavenDB Indexes. This is useful for two reasons. Firstly, because it gives us a final opportunity to distort the data into something we can use. Secondly, it provides an argument to the function that lets us re-query the database. As you can see, we are working with the Ids of the Pixels, not the Pixels themselves. We need to return the full Pixel in our results so we can pull out the required piece of html content to return with the page. In addition, we may need to query on other properties of the Pixel.

The aim of the TransformResults function is to give us the final output of the query from the output of the Reduce function. In simple terms, we are selecting every Pixel Id from each element in the results of the Reduce function into one flat list of pixel Ids. For each Pixel Id we then select the corresponding Urls from the same Reduce function result into a new projected object. We populate the projected object with the actual pixel, loaded from the database argument using the Pixel Id. Alongside the Pixel is the collection of Urls it applies to. This meets the demands of our query.

Pixel Urls
{ Id: 1 } bob.com
{ Id: 2 } bob.com, tom.com, ann.com
{ Id: 3 } bob.com, jim.com, pam.com
{ Id: 4 }
{ Id: 5 } bob.com, tom.com, ann.com


The Result
Now, with a query that doesn't require hitting the database twice, we have seen a sharp decrease in the traffic we send. We are still making use of caching, so the dramatic change in throughput in the graph below is not down purely to the query, but it certainly was a factor. We now have the room required to lower the cache duration, and still perform better than we were.




The Cost
Additionally, I brought the Tag querying into code so that all that we ask for from the index is the full set of results (which is very small). This means that there are fewer unique queries to hit the data base with, so it is easier to cache the result. However, it does mean that the CPU has to work harder to query our cached data. On the other hand, Application memory footprint has reduced, presumably as a result of not having to hold the results of so many distinct queries in memory:



The Index
Here is the actual index in its final form: