Find Problems With Checklists, Solve Them With Services

You may find the following useful if you are getting distracted from useful work by efforts to confirm the detail of where your team and others have got up to.

We got interested in this problem because we were doing analysis on two different projects for two different clients at two different stages and finding the same problem in that a scrum board and a set of design pages was insufficient. The current position was not clear enough, and projected progress was also unclear. We also thought we could do better at evolving the designs as the requirements and constraints evolved, and keeping the user stories fully in sync with the designs, requirements and constraints.

The solution concepts were

  1. Be organised

    1. Earlier visibility of a more complete list of stories gives more options for solving problems

    2. Things can change but if the stories are at the right level of detail then it gets the right balance between identifying key problems earlier versus wasted administration effort - we should be as specific as the stability of information can justify

    3. Remember that being organised doesn’t actually solve the problems, though it helps identify them and can avoid other problems occurring due to missed scope - look out for strongly diminishing returns on administrative effort once the problems are identified and look for an architectural rather than an administrative solution

    4. Use Point of View Checklists to facilitate stakeholders identifying completeness of stories and design across the whole solution - for example have a Security PoV Checklist that lists the key security NFRs and gives links to the specific design pages and the specific user stories - this will reduce gaps versus expectations, and provides a convenient way to review projected progress with that stakeholder, as they can see which user stories are still open. The effort of maintaining the checklist is paid for by how much faster the reviews become.

    5. Use Design Checklists to facilitate identifying that the design pages are in line with the user stories, and that the designs are current rather than your Confluence is turning into a Necropedia (thanks D.I. for this term)

    6. The above points improve planning - it is still necessary to deliver using a solid definition of done within sprints, including getting stakeholder inspections

  2. Be in production

    1. There are diminishing returns on effort for relating planning material and status reports to actual status, but if the service is actually running then it can be inspected, and if it is running in production then it can be inspected without any need to explain quirks of the environment

    2. There can be a formal “preview” production version as well as a “stable” production version, giving the above benefits of inspecting in production without compromising a production system that other services may rely on

    3. The same stakeholders with whom we use the Point of View Checklists can do exploratory testing or inspections of preview or stable production as well as using the production service specifications to be completely clear what is currently delivered

    4. To validate the design status versus what is actually delivered, with a more formal preview environment the team is able to do more exploratory testing and it gives additional options for chaos engineering style experiments

    5. Avoid “go live” disconnects by basing understanding of what we have by what is available in production rather than what the project plan says is done

    6. If the simple service you have at the start is difficult to put into production, the problem will get worse as the service become more complex. This doesn’t mean actually go General Availability on the first sprint, but the gap to being able to do so should be small.

    7. We often don’t know when we’ll be completely done, but following a sensible approach it’ll be the fastest it could be. We DO however know what’s already in production.

  3. Describe services not projects

    1. This all assumes we’re delivering a service not a project - without a service you can’t tell what you have, and there is nothing to inspect, and there is a risk of neglecting operations

    2. Services are unmanageable if they’re big, so they will need layering

    3. You can’t Agile a project, you can only Agile delivery of services

    4. If you’re working on a project, one of the most useful architectural techniques is to describe how to reach the desired outcome through a layered set of services, some of which you will create, some of which already exist, and (avoid if possible) some of which need to be created by other teams.

  4. Relate to other production services

    1. Developing services on top of other development services is risky versus on top of production, because those services are not yet stable and we can’t know for sure when they will be stable. It is possible to establish conventions for integration testing between development services but there are costs associated.

    2. Describing a project as a set of layered services and working out what you can use that’s already in production greatly reduces risk. A system can evolve no faster than its slowest integration point so if you can use services that are already in production with the required functionality, then the risk of a slow down is greatly reduced.

  5. Solve difficult problems by using suitably scalable services

    1. When layering services, any weaknesses will be a weakness of the whole solution. This is the same problem in projects but you can see it better. Don’t take an approach that has these weaknesses!

    2. Understand and use other existing production services that scale adequately to solve the above weaknesses - this is a considerable investment of time and can get neglected if the team is too narrowly focused on delivering user stories. If you’re the architect you need to choose to do this because success/failure feedback on this is closer to the project timescale, not the sprint timescale. You don’t get feedback on what you don’t test, and if you don’t invest in finding out what you should test, you won’t even know what you’re missing.

    3. Be conscious of the base rate for success at the task you’re doing - if reference class forecasting for the approach shows failure then do NOT try to get order of magnitude improvements by working harder at either development or at management. You need to use different services.

The solution in practice …

Relating design pages, design checklists user stories and environments with the following diagram can help explain what the point of view checklists are reviewing:


We’d suggest there’s value in reviewing point of view checklists periodically - the following list of areas is representative of what we’ve used or intend to use:

  • The actual users

  • Senior management

  • Operations management

  • Availability/Disaster Recovery management

  • Data governance

  • Infosec

  • Legal (including licences)

  • Architecture governance

  • Another architecture governance team (there could be several)

  • Various CoE teams

  • Vendor relationship management (prevent them complaining about how you used the product)

  • QA governance

  • End user documentation management (especially user documentation, and also end user training)

  • Resource cost and availability management

  • Delivery timescales management

  • Infrastructure costs (specialist and general)

What turned out to be more interesting was that in decomposing a project into services it becomes a lot more apparent where the problems are, and a particularly interesting one is the use of the network for any on premise (rather than cloud) components of the solution. With data warehousing or big data projects that have several TB of data that needs to be moved quickly, we can use the concept of a service dependency to tell the networks team how much throughput we expect to need to use. Until that amount of throughput is reliably available in production, we are a stakeholder reviewing a point of view checklist for the team making changes.

Another example where there are more difficulties are with access to source systems where full self service access is not being provided. Where there isn’t a reliable production self service access mechanism and we need to raise tickets to get another team to do some work, we have greater uncertainty and progress is not nearly as reliable or as fast.

An example of where architectural changes can make a significant difference is where we swap out one component for another that requires much less manual effort to operate. Managed service databases make a big difference for data warehousing environments in this way - the number of days that typically get lost to maintaining backups with on premise MPP systems is eliminated because that entire underlying service of doing the most basic database management is handled by the supplier rather than the team.

Other Considerations and References…

Get much simpler, be really clear...