Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless — the right way of aggregating response time data is to add the histograms
Good luck with that.
Grass spews out a metric shitload of requests, with the number based on the user's social graph and the number of books in a work. We measure the maximum wait time for our threads per request, and take the tp99 of that across requests across many servers.
It's mathematically meaningless as a measurement, but when it goes up we know we need to adjust our thread pools or add servers.