For production it is common practice to set your log level to alert due to the amount of traffic. This is because you have to consider various cost factors:

Cost of Logging: CloudWatch charges $0.50 per GB of logs ingested. In my experience, this is often much more than the lambda invocation cost

Storage Cost: CloudWatch Log charges $0.03 per GB per month, and its default retention policy is never-expired! A common practice is to migrate your logs to another log aggregation service and set the retention policy to X days. See this post for more details.

Cost of Processing: If you’re processing logs with lambdas, you need to take into account the cost of lambda invocations as well.

But, doing so leaves us without any debug logs in production. When a problem occurs in production, you won’t have debug logs to help you identify the root cause.

Instead, you have to waste precious time deploying a new version of your code to enable debug logging. Not to mention that you must not forget to disable debug logging when you deploy the fix.

With microservices, you often have to do this for more than one service to receive all the debug messages needed.

All these increase the average time to recovery (MTTR) during an event. That’s not what we want.

It shouldn’t be like this.

There is a happy middle ground between not having debug logs and having all debug logged. Instead, we should sample debug logs from a small percentage of invocations.

With lambdas, I don’t need most of the features from a full logger like pino. Instead, I prefer to use a simple logger module like this one.

Using midi, I can create a middleware to dynamically update the log level in order to debug. It does this for a configurable percentage of invocations. At the end of the invocation, the middleware will restore the previous log level.

You can see that we also have some special handling when an error occurs in invocation.

It’s great to have debug logs for a small percentage of invocations. But when you’re working with microservices, you need to make sure that your debug log covers the entire call chain.

It’s the only way to put together a complete picture of everything happening on that call chain. Otherwise, you’ll end up with fragments of the debug log from multiple call chains, but never a complete picture of one.

You can do this by forwarding the decision to turn on debug logging as a correlation ID. The next act in the series will honor this decision, and pursue it. See this post for more detail.

That’s all, one more pro tip on how to build observability into your serverless application. If you want to learn more about how to do everything with Serverless, check out my 10-Step Guide here.

Production debugging, as the name suggests, is when one must debug the production environment and see the root cause of this problem. It is a form of debugging that can also be done remotely, as during the production phase, it may not be possible to debug in the application’s local environment. These production bugs are also harder to solve, as the development team may not have access to the local environment when problems emerge.

Production debugging cycle.
Why production debugging is needed
In an ideal world, all errors and bugs would be caught in the development or QA phase. No differences will exist between the three environments, making the entire deployment workflow more robust and predictable. All settings will be the same. However, the world is not perfect, and therefore such perfect uniformity is difficult to achieve.

For example, consider an application highly oriented around data (internal or third party). Data sets for production are not the same as datasets for QA or development. In this case, problems may arise that were not caught at an early stage because the production environment uses a different, untested data set.

There are two possibilities in this scenario: either the data set will be made available for testing in other environments or an attempt will be made to directly identify the problem and its solution in the production environment. When the latter possibility is realized, production debugging procedures are followed.

Modern infrastructure challenges for production debugging
Today’s infrastructure is becoming more and more distributed. While this is mostly for maintaining large applications efficiently, it is difficult to debug because it is difficult to trace the bug back to its source. In a distributed application, there are many moving parts, and when a problem occurs in the system, it must first be isolated to see its origin.

An example of such a phenomenon is serverless computing. Not only does it use a distributed architecture, but it also represents an abstraction of the underlying application infrastructure and its capabilities. In this architecture, applications are decoupled at the functional level, which are single-purpose, programmatic functions hosted on a managed infrastructure.

Therefore, it is almost impossible for the developer to perform the debugging process under normal circumstances because the application does not run in a local environment.

In these circumstances, developers need to collect enough information to solve the problem directly from the running application (function in the case of serverless). Therefore, a remote troubleshooting procedure is required. As mentioned earlier, production debugging can also be done remotely through the remote debugging process.

Leave a Reply

Your email address will not be published. Required fields are marked *