Summary So, we’ve covered a lot here! I’ll attempt to summarize this chapter into some easy-to-follow advice. For each service: Track inbound response time at a bare minimum. Once you’ve done that, follow with error rates and then start working on application-level metrics. Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here. Standardize on how and where metrics are collected. Log into a standard location, in a standard format if possible. Aggregation is a pain if
Summary So, we’ve covered a lot here! I’ll attempt to summarize this chapter into some easy-to-follow advice. For each service: Track inbound response time at a bare minimum. Once you’ve done that, follow with error rates and then start working on application-level metrics. Track the health of all downstream responses, at a bare minimum including the response time of downstream calls, and at best tracking error rates. Libraries like Hystrix can help here. Standardize on how and where metrics are collected. Log into a standard location, in a standard format if possible. Aggregation is a pain if every service uses a different layout! Monitor the underlying operating system so you can track down rogue processes and do capacity planning. For the system: Aggregate host-level metrics like CPU together with application-level metrics. Ensure your metric storage tool allows for aggregation at a system or service level, and drill down to individual hosts. Ensure your metric storage tool allows you to maintain data long enough to understand trends in your system. Have a single, queryable tool for aggregating and storing logs. Strongly consider standardizing on the use of correlation IDs. Understand what requires a call to action, and structure alerting and dashboards accordingly. Investigate the possibility of unifying how you aggregate all of your various metrics by seeing if a tool like Suro or Riemann makes sense for you. I’ve also attempted to outline the direction in which monit...
...more
This highlight has been truncated due to consecutive passage length restrictions.
Summary of integration chapter