A bit later than promised, but a good start of 2014: part 2 on Hippo CMS architecture. I will now zoom in a bit on the Hippo Delivery Tier.

Prerequisite:

A Bird's Eye Hippo CMS Architectural View Part 1: 10,000 foot view

Disclaimer: This blog is meant to give an architectural background on the Hippo Delivery Tier (HST). I cannot explain everything in detail in a single blog. I tried to stick to the most relevant parts for an initial understanding of the global HST architecture. It might not be trivial to digest, certainly not if you don't have a strong technical background. Before reading this document, it certainly helps if you first watch the movie A quick overview of Hippo CMS in just under 3 minutes. Note that the first minute is mainly about the enterprise relevance add on and is not so important to understand this blog. Another good read about Hippo global architecture can be found at Understanding Hippo CMS 7 Software Architecture by Woonsan Ko. At certain parts it might give a better global overview, in other parts this blog might be more specific, and some parts will overlap. Best if you read both :). If you'd also follow the trails at Getting started with Hippo it will further help you understand this blog.

A brief history of names:

What’s in a name? that which we call a rose
By any other name would smell as sweet;

Obviously, the above is complete nonsense in the field of programming. Names are an important aspect how developers communicate in speech and in writing. If a class is called BookProxy, BookDecorator, BookFacade, SingletonImmutableBookFactory, BookWrapper, etc a developer tells you already what kind of behavior you can expect from the class. There is a lot in a name. Yet, we make mistakes with names, and typically, when we do so, we’re stuck with it because they become part of the api, marketing, documentation or not so easy to change schema’s.

And we made a mistake when calling the Hippo Delivery Tier HST. HST is an acronym for Hippo Site Toolkit. In general a Toolkit isn’t very exactly defined, but can be seen best as a combined set of libraries. A library is a collection of code relating to a specific task, or set of closely related tasks which operate at roughly the same level of abstraction. Developers call the code in a library instead of being called, as is the case with a framework. The HST is exactly the latter, as your code is being called by the HST, and the control is inverted compared to a library or a toolkit. Hence, when we started with the HST, a better name would had been HSF (the F from framework). However, since we do not only support Site development, but support the content in our repository to be rendered in any format/channel (desktop webpages, mobile webpage, pdf, rest xml/json response, mobile apps), the name that in my opinion would best cover the HST these days would be HDF: Hippo Delivery Framework. However thus, due to historical reasons, I will stick to the acronym HST.

Content Driven

As the title already suggests, this blog will cover HST architecture from a 1,000 foot view. I will not explain the CMS document editing part itself in which editors/authors work, or how to extend the CMS or workflow or add custom CMS plugins. However, since many HST design choices stem from our vision on content (management), I first need to explain how we at Hippo look at content. As part of our DNA since the early days, we are firm believers that authors should write content (a document) once, and that we expose that content differently across channels, depending on the request (desktop, mobile, pdf, rest xml/json, etc). Also, we believe that authors should be concerned about writing a document with limited markup (like header, bold, bulleted lists, tables, etc), but not an entire web page including layout. Web page layouts (page definitions) are initially created by developers and in production maintained by webmasters/marketers: For example every news document for the English page of my website below the URLs that start with /news should have the same layout and not be maintained by an author per document. This resulted in that the HST had to deal with documents/content being only a part of complete response. With documents having relations to each other expressed as UUIDs and not as URLs. Yet, in our delivery tier we wanted to expose SEO friendly URLs with no UUIDs in there.

Designers and then developers are initially in charge of how the pages for certain documents look like. Once in production, webmasters can modify a page definition through the channel manager, but, important to realize, these page definitions are frequently shared by many documents. Changing the webpage layout for one news document typically changes the webpage layout for all other 100.000 news documents as well...instantly. In a nutshell : Hippo CMS is content driven, not page driven. It is content that really matters. Content is your most valuable asset. Presentations follow later.

Context Aware Content Driven

With the introduction of our enterprise relevance support, we’ve added to the content driven design also context: depending on the context of a visitor, the HST can serve different pages for one and the same URL. The context can be anything, from geo-ip location, whether the visitor is a returning visitor or not, whether he/she mostly looks at jobs or products, whether he arrived via Google, whether he searched for certain terms, etc. More about this in a later blog.

For now, this should be enough background of how we look at content/documents. Below I will start with a very short overview of what I think are the HST key features. After this I will start explaining where I left off the previous blog: How the HST request flow through the HST works. In this blog I won’t be able to zoom in on every key feature. Some features already have a detailed documentation or blog about them. Those will have links. In upcoming blogs about Hippo CMS architecture I will explain some more features in detail that are not yet covered somewhere else.

HST key features in a nutshell

The HST has had a primary focus on performance from its early beginning. By default, it is stateless and has a linear scaling up and out implementation [1], [2]. Without compromising performance it renders any changed content in the repository directly live. It supports virtual hosting and URL matching capabilities [17], modifiable at runtime and stored in the repository. Since Hippo CMS lets authors and editors be concerned with writing content, and have this content exposed across different channels, the HST has built-in link rewriting for relations between documents, where the results are clean SEO-friendly URLs [30]. The link rewriting seamlessly accounts for links between different domains (hosts), and is context aware: Given the context of the visitor, different URLs for one and the same relation between two documents can be created at runtime. The link rewriting also seamlessly integrates with cross HTTP/HTTPS links [3].

Another very powerful HST feature is the default page rendering: A page definition is stored in the repository as a hierarchical tree of components with optionally per component a Java class (controller preparing the model) [21] and a view (JSP / Freemarker template) to render the output. Since the page definition is stored in the repository, it enables us to change the pages for, say all news documents, in the channel manager in a runtime production environment (with preview / live workflow) by just changing the page definition through one of the news pages in the channel manager [18]. Note the subtle difference between a news document, that what an author in the cms works on, and a news page definition: that how all news document pages look like in a specific channel and can be managed by a webmaster in the channel manager.

The HST contains a simple lightweight lazy content bean mapping to map JCR nodes (repository stored content) to domain specific Beans/POJOs (e.g. NewsDocument.java) created by developers. To avoid developers having to learn the various optional JCR query languages, including its do’s and don’ts, the HST ships with a Java search API [26] that only creates JCR queries that are known to perform and scale [28], [29]. Since authors and editors who are working on documents (content, not webpages) in the CMS typically want to preview how their document looks in all different channels in which the document is exposed, the HST by default supports a preview and live version, internally handled by using a JCR session from the preview session pool or live session pool. Part of this preview/live support and search API support, is that the HST has through the repository instant authorized search results with correct authorized counts for preview or live [4]: instant as a result of Hippo Repository being able to map Hippo’s security model to Lucene queries, the backing search library of Hippo Repository. Through the Repository, HST also supports a very straightforward API for faceted navigation structures [19], including instant correct authorized counts through [4]. Since recently, an extra Hippo Repository security feature has been added, Security Delegation [16], which enables us to combine the access rules of two separate JCR sessions. This is particularly useful if editors/authors need to be able to preview their document in a certain channel, but the normal HST preview session user is not allowed to read the document: for example because it contains highly sensitive information. Therefor, when an editor previews a document in a channel in the CMS, we combine her read access with the read access of the preview site user. Since we do this on repository level, we even keep all the aforementioned security and efficient authorized querying for combined sessions.

HST has a pluggable container which is using Spring Framework configurations. An important pluggable part are the processing pipelines, which can be added or modified by injecting custom valves (by developer and not runtime!). Through the concept of these processing pipelines, HST supports restful API integrations [5] to expose content as a web service over json or xml: It supports creating RESTful Web Services through the support of JAX-RS [27] integration (Java programming language API that provides support in creating web services according to the Representational State Transfer). The HST container can be simply integrated with other Web Application Frameworks [6], including full-fledged MVC frameworks such as Spring Web MVC, Struts2, Cocoon, Wicket, Tiles 2, Sitemesh, etc. Also, it includes just a simple Filter/Servlet based handler module (e.g, even simple 'Hello, World!' style Servlets or JSP).

HST Security contains authentication/authorization support for websites, including JAAS and form based authentication support. Spring Security Framework can be configured with this too, in order to support various security requirements such as SiteMinder integration, Enterprise SSO integrations, etc.

For optimal performance, the HST supports page caching, configurable per host / mount / URL (prefix). To support page caching when not all parts of a page are cacheable, there is support for async ajax rendered parts or ESI on a page [7], [8], [9]. Even though the HST in itself is blistering fast and scalable, this doesn’t mean developers cannot implement non scaling or just plain slow solutions. Quite frequently, non scaling solutions only manifest themselves when the repository (content size) grows bigger. Think about what happens when a developer does a search for all news documents in the entire repository, and then iterates through all the documents from the search result and do a new search per hit to find for example the 5 news documents with the most comments on them. Obviously, this won’t scale [20]. In case you find yourself in a situation where responses are too slow, the HST has built-in diagnostics support that can be switched on/off in production to help you dissect how long different parts of the response took to generate [10].

As part of the Enterprise edition, HST supports relevance: personalized pages depending on pluggable collectors (e.g. location, referrer, returning visitor, searched for, visited content containing keywords, etc) aggregating context/knowledge about the visitor. The relevance module is integrated through an enterprise valve in the default processing pipeline (see below) and seamlessly works with page caching. Because the HST has such extreme fast horizontal and vertical scaling capabilities, we were able to embed/integrate personalization inside the application, instead of as a poor man’s approach layered on top of an existing application. This makes it much more powerful and customizable in its usage.

Last but not least, a feature in itself of the HST is that it uses standards based widely adopted technologies making developers being able to learn it quickly deliver fast: With the HST, back-end developers in general program HstComponent classes in Java and write the rendering in JSP or Freemarker and can use a standard HST tag library [22] for most common rendering features. If they want to change the core, they inject or override Spring configuration of the container. If the application to be built is using HST its restful API integrations, front-end and back-end developers can work even more independently. 

HST key features list

  1. Performance and complete vertical and horizontal scaling
  2. Rendering content changes without delay
  3. Runtime modifiable virtual hosting and URL matching
  4. Built-in context aware link rewriting, seamlessly integrated for linking between different hosts
  5. Seamless cross HTTP / HTTPS linking support
  6. Composite component based page rendering modifiable at runtime
  7. Content Bean/POJO Mapping
  8. Search API with authorized search results directly from Lucene
  9. Preview and live support
  10. JCR Session pooling
  11. Faceted navigation support including instant correct authorized counts
  12. Security delegation
  13. Pluggable container through Spring configurations
  14. Restful API integration
  15. HST Container Integration with Other Web Application Frameworks
  16. HST Security
  17. Page caching
  18. Async ajax rendering and/or ESI
  19. Page diagnostics
  20. Relevance and personalized pages based on context
  21. HST uses widespread common technologies

Understanding the HST request handling

The previous blog, A Bird's Eye Hippo CMS Architectural View Part 1 , I ended with the 1,000 foot schematic request handling  graph through the HST below. The rest of this blog I will zoom in on this request handling.

 

Above, on the left, there is the Repository application containing the stored HST configuration and on the right there is the HST application with a very global schematic request flow. Note that both repository and HST are deployed within single container, optionally as a single webapp or as two separate webapps. The request matching and link rewriting of the request handling uses the matching configuration (runtime modifiable) in the repository, the request processing uses repository stored HST configuration for page definitions (runtime modifiable) and the repository stored content (documents & binaries) maintained by editors/authors. Since link rewriting is actually just the inverse of the matching phase and actually happens during request processing instead of after, easiest to understand the HST request handling is by splitting it into two main phases:

  1. Request Matching
  2. Request Processing

Request Matching:

An elaborate documentation on the matching phase is described at [17]. In short, an incoming request is first tried to be matched for some host (domain), then for a mount (subsite/sub-channel) and then for some sitemap item (the remainder of the URL after the mount).

An important aspect to realize about request matching, is that all matching configuration, host, mount and sitemap, is stored in the repository, and can thus be changed at runtime in production. This setup also enabled us to support complete new channels from blueprints [23] added to a runtime production environment without even requiring a httpd restart: Remember from A Bird's Eye Hippo CMS Architectural View Part 1  the httpd configuration we use:

 

Any request matching the domain *.example.com gets forwarded to http://127.0.0.1:8080/site (obviously 127.0.0.1 is only the case when httpd runs on the same server which is in general production setups not the case). The HST application still has access to the original host (domain) through ProxyPreserveHost On, hence, the HST can resolve the original request to match a configured host in the repository. This means, that with the above httpd configuration, any new channel, for example my.example.com, can be added at runtime. If the ServerAlias would just be *, then any request on this httpd instance on port 80 that is not from cms.example.com would be forwarded to the HST.

Request Processing:

In general, after the request matching has been done, some sitemap item has been matched (there are also cases where there is only a matched mount, for example for some JAX-RS services [11], but that is for now left out of scope). From a matched sitemap item, first of all, the pipeline to process is retrieved. When there is no explicit pipeline configured, the HST container defaults to the DefaultSitePipeline. Pipelines are based on inversion of control pattern. The request processing pipelines are assembled through Spring Framework configurations. The pipelines that ship with the HST can be modified/overridden, as well as complete new ones can be added. Default pipeline configurations present in the HST core can be found in SpringComponentManager-pipelines.xml. At SpringComponentManager-cmsrest.xml you can see an example of how a non core CmsRestPipeline (used for CMS channel manager communication over REST with HST) is injected into the existing pipelines.

The request processing component (HstRequestProcessor) of HST Container invokes three series of valves: initializationValves, processingValves and cleanupValves. The initializationValves contains valves which are responsible for initialization, the processingValves contains valves which are doing core request/response rendering, and the cleanupValves contains valves which are responsible for cleaning up the temporary request processing data and optionally outputting diagnostics information about the request that was done. As of this writing (CMS 7.9.0), the DefaultSitePipeline looks as follows:

DefaultSitePipeline 

Because the valves inside a pipeline is in turn are also Spring managed configurations, it is possible to override existing valves, or, through HST orderable Valve support [24] inject new ones in specific locations. This is for example how we can plug in enterprise features into the open source HST. An example of this is our enterprise relevance, which injects itself as the last initialization valve as follows:

updatetargetingvalve 

In general, as a developer working on an end project you do not often need to inject your own valves into the existing HST core pipelines, but it is a powerful mechanism that enables us to inject enterprise features or custom behavior easily. Also it enabled us to build the entire channel manager communication over REST with the HST through separate secured pipelines.

As already mentioned, the DefaultSitePipeline is used when no explicit pipeline is configured. The DefaultSitePipeline assumes processing with HST components. Normally, a page definition consists of a composite tree of hst components. For a normal render (get) request, the final processing valve is the AggregationValve, which invokes first all #doBeforeRender methods of all HstComponent classes for the current request, and then for every component invokes its renderer (JSP or Freemarker script, where the latter can be stored and maintained in the repository itself). Below there is a schematic example of a breakdown of a page layout into a hst page definition. I used /news/**.html to indicate that any URL ending with this path info gets this page definition. It is easy to see that the page layout can be shown as a composite tree components : This is the composite tree I mean when I talk about page definitions. It is exactly the components in this tree that get modified when a webmaster modifies pages or drag-drops areas of a page around in the channel manager. This is of course possible since the component tree shown below is stored in the repository and reloaded on any change into the HST model.

page layout decomposition

For action requests, a single HstComponent’s Java class #doAction method is invoked after which directly a browser side http redirect to a render (get) request is done: the HST uses the Post/Redirect/Get pattern [25]. For the schematic request processing, for now, I’d like to refer to Request Handling with Components which can be found a [12].

Request Matching - - > Request Processing and the HstRequestContext

Above, I’ve explained in a nutshell and referenced to documentation about the request matching and request processing, but did not yet very explicitly describe the part where the HST container hands the request matching over to the request processing. What page definition to use for the matched sitemap item, and, optionally, which content to retrieve from the repository? Therefore, a developer or webmaster configures extra information on a sitemap item.

The two by far most important sitemap item properties to set are hst:componentconfigurationid and hst:relativecontentpath. Optionally, the pipeline to use can be configured when you want a different then DefaultSitePipeline to be used.

Page Definition to use for a sitemap item -> hst:componentconfigurationid

This is configured through the property hst:componentconfigurationid. Never mind the name, historical reasons. It is relative to the HST configuration for the current channel, and commonly points to a component below hst:pages, for example the sitemap item at hst:sitemap/news/**.html can have hst:componentconfigurationid = hst:pages/newspage, implying that any URL that matches news/**.html will be rendered as a newspage.

Content to use for a sitemap item-> hst:relativecontentpath

When a document from the repository needs to be rendered for some URL, the relative content path points to the location (relative to the content of the matched channel) where the document (or folder) can be found. For example the sitemap item for the homepage could contain hst:relativecontentpath = common/homedocument, implying the document in the CMS to be shown for the homepage is located at common/homedocument. If the channel for the request has as root content /content/documents/myproject, then the home page content is located absolutely at /content/documents/myproject/common/homedocument.

Obviously, you should not create a new sitemap item for every new document that an author adds. Instead, we use the wildcard patterns that the sitemap supports [13]. For example the sitemap item at hst:sitemap/news/**.html can contain a relative content path hst:relativecontentpath = news/${1} where ${1} gets the value of the URL that was used to match **.

During the matching phase, a HstRequestContext object is created, the most important HST framework object, that has the lifecycle of a HttpServletRequest. It is stored as a thread-local variable and can be accessed during the entire HST request handling through the static getter

RequestContextProvider#get() 

After the matching phase, the HstRequestContext contains matching information flyweights that can be accessed through getters, for example HstRequestContext#getResolvedMount() or HstRequestContext#getResolvedSiteMapItem(). A ResolvedSiteMapItem is a flyweight wrapping a HstSiteMapItem configuration object, where the latter is the same object instance for every concurrent visitor. The flyweight adds only some request specific information, for example the ${1} for the hst:relativecontentpath = news/${1}. This is a fundamental design pattern in many parts of the HST that results in that for the 7.9-alpha I can render 3.500 pages per second without page caching enabled on my laptop: There is the shared global, almost immutable (more about this in a later blog, but for now see [2]) HST model that every request has a reference to, but accessed through a couple of very tiny flyweights adding just the request specific information.

After the matching phase has been done, the HstRequestProcessor executes the pipeline belonging to the matched sitemap item. Most generally, this is the DefaultSitePipeline, where thus first the initializationValves, then the processingValves and then the cleanupValves are executed. After the InitializationValve has been processed, two very important getters on the HstRequestContext also become available:

HstRequestContext#getContentBean() 

This is the bean/POJO Java class that wraps the backing JCR stored Node (document). In case when for example the URL is /news/2012/06/my-first-article.html, and it matched sitemap item is news/**.html with hst:relativecontentpath = news/${1}, the HST tries to fetch the content from news/2012/06/my-first-article. If it is not present, it returns null

HstRequestContext#getSiteContentBaseBean()

Returns the bean/POJO (HippoFolderBean) for the root JCR content node of the channel (site) belonging to the current request.

A normal render request hits the aggregationValve (assuming the pageCachingValve not already returned a cached result) as the last processing valve. The aggregationValve finds through the hst:componentconfigurationid from the matched sitemap item which page definition to use and renders all components and aggregates them into a single response. After the aggregationValve finishes, the cleanup valves are invoked, cleaning up thread locals and returning JCR sessions to the pools. After this, the response gets send back to the visitor.

If you have read until here, and you did not already have experience with the HST, I can imagine not everything is yet crystal clear :-). The next blog I will write in this series of blogs should help you further understand the HST and its internals, as in that blog I plan to trace a single (standard get) request through the HST container. Stay tuned for CMS Architectural View Part 3: tracing a single request through the HST.

More about link rewriting in detail in a next blog but for now very briefly:

Link rewriting in a nutshell

I already explained that at Hippo we think authors should focus on content, not on presentation. Obviously, they want to relate documents to each other, either by ‘see also’ kind of blocks, but of course also by internal links. Obviously, we do not store URLs: URLs depend on which channel the content exposes, and even what the context of the visitor is. Instead, UUIDs are stored as the relations. When querying the repository through the HST Search API [26], the HST retrieves JCR Nodes which it also should be able to map to URLs. During rendering, the HST has to translate these UUIDs into real URLs. From UUIDs to JCR nodes is simple:

Session#getNodeByIdentifier(String id)

From a JCR Node, through Node#getPath() we get the hierarchical location, path, of the node. From thispath we remove the prefix that the channel points to as content root. For example if the node path is /content/documents/myproject/news/2014/06/my-second-article and the channel myproject has as content root /content/documents/myproject, then the relative path to the channel is news/2014/06/my-second-article. Obviously, this document can be accessed through the relative content path news/${1} of sitemap item hst:sitemap/news/**.html. Thus, the HST can create a URL for the JCR Node (document). This is the most basic example, not explaining context aware linking, cross channel linking, http/https linking etc. More on this you can read at [14], [15].

If you wonder whether this link rewriting might be a little bit CPU expensive if you have many channels with sitemaps with many sitemap items, then the short answer is no. We do not even cache the link rewrite results, as they are too cheap to generate. This is because when we create the sitemap item matching in-memory-model (model of URLs paths to relative content locations), we also create the inverse model (map of  relative content locations to sitemap items). This is just a simple tree of hashmaps, through which it is very cheap to find possible matching sitemap items for any node in the repository.

Summary

In this blog, I've given a high level overview of the HST key features and about its design. Then I took a bit more technical deep dive into how HST its request handling works. We've seen that it contains two main phases, Request Matching and Request Processing and how they are connected. In later blogs, I will further explain about the HST architecture and concepts.

References:

[1] http://www.onehippo.com/en/resources/blogs/2013/03/reproducible-and-falsifiable-performance-tests-with-hippo-cms.html
[2] http://www.onehippo.com/en/resources/blogs/2013/01/delivery-tier-stale-model-support.html
[3] http://www.onehippo.org/7_8/library/concepts/request-handling/hst-seamless-https-support.html
[4] http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html
[5] http://www.onehippo.org/7_8/library/concepts/rest/restful-jax-rs-component-support-in-hst-2.html
[6] http://www.onehippo.org/7_8/library/concepts/integration/hst-container-integration-with-other-web-application-frameworks.html
[7] http://www.onehippo.org/7_8/library/concepts/component-development/asynchronous-hst-components-and-containers.html
[8] http://www.onehippo.com/en/resources/blogs/2013/04/hippo-cms-and-edge-side-includes.html
[9] http://www.onehippo.org/7_8/library/concepts/web-application/hst-2-edge-side-includes-support.html
[10] http://www.onehippo.org/7_8/library/concepts/request-handling/hst-page-diagnostics.html
[11] http://www.onehippo.org/7_8/library/concepts/rest/restful-api-support---plain-jax-rs-services.html
[12] http://www.onehippo.org/7_8/library/concepts/request-handling/hst-2-request-processing.html
[13] http://www.onehippo.org/7_8/library/concepts/request-handling/sitemapitem-matching.html
[14] http://www.onehippo.org/7_8/library/concepts/links-and-urls/hst-2-urls.html
[15] http://www.onehippo.org/7_8/library/concepts/links-and-urls/context-aware-canonical-preferred-and-navigationstateful-urls.html
[16] http://www.onehippo.org/7_8/library/concepts/security/repository-session-security-delegation.html
[17] http://www.onehippo.org/7_8/library/concepts/request-handling/hst-2-request-matching.html
[18] http://www.onehippo.org/7_8/library/concepts/template-composer/hst-2-template-composer.html
[19] http://www.onehippo.org/7_8/library/concepts/faceted-navigation/faceted-navigation-configuration.html
[20] http://en.wikipedia.org/wiki/Schlemiel_the_Painter's_algorithm
[21] http://www.onehippo.org/7_8/library/concepts/component-development/hst-2-component-development.html
[22] http://www.onehippo.org/7_8/library/concepts/web-application/hst2-tag-library.html
[23] http://www.onehippo.org/7_8/library/concepts/channels/blueprints.html
[24] http://www.onehippo.org/7_8/library/concepts/hst-spring/hst-orderable-valve-support.html
[25] http://en.wikipedia.org/wiki/Post/Redirect/Get
[26] http://www.onehippo.org/7_8/library/concepts/search/hst-2-search.html
[27] http://jax-rs-spec.java.net/ 
[28] http://www.onehippo.org/7_8/library/concepts/search/fast-date-range-searches-with-hippo-repository.html
[29] http://www.onehippo.org/7_8/library/concepts/search/out-of-the-box-fast-date-range-query-support-from-hst.html
[30] http://en.wikipedia.org/wiki/Clean_URL