Logging (WIP!!)

Last modified on August 27, 2019 by Alex


Introduction

Logging is a process of recording events taking place during the execution of a system, an application, or a service. The resulting log can be used to understand the behavior of the system through the time and diagnose problems.

Having good logging is particularly important for systems that run for a long time with little or no user interaction, such as web servers or microservices.

Logs are not just used for diagnosis & monitoring. They can be used to gain insight into other critical aspects of the system such as security, missing pages, failing requests, performance, business activities and audience demographics.

According so studies, in C/C++ projects (might not apply to managed languages with stack traces3):

Log messages reduce median diagnosis time of production-run failures between 1.4 and 3 times (on average 2.2x speedup)1.

Purposes of logging

Logging may be used for the purposes of:

  • Debugging
  • System Administration & Dev Ops
  • QA insights
  • Performance monitoring
  • Business Intelligence
  • Security

The most common kind of logging is sometimes called tracing, logging done by developers for debugging purposes. These logs are usually granular, provide very low-level information, are added in an ad-hoc manner and therefore noisy. They are often cryptic and unusable by anyone but the responsible developer the very day they added them :)

  • How did this unlikely situation happen?
  • What was the root cause of this error?
  • How do we verify that this error has been fixed in production system?
  • Which silent errors still need to be fixed?
  • Is there a persistent record that allows postmortem analysis as opposed to ephemeral dialogs and final messages in assert statements?
  • Are intermediate states of our software are as we expect them to be?
  • Is there any additional context for an error or a bug report?

Logging for system administrator or dev ops purposes is usually called event logging. It logs high-level information (e.g. failed installation of a program). These should be more stable and at least somewhat documented. In practice, they are usually “self-documented”.

  • How many operations are failing?
  • Which specific operations are failing?
  • Is system performance falling behind?
  • Is the system still operational?
  • Which nodes are experiencing more failures?
  • Which nodes need to be restarted or be allocated more resources?
  • Why did the entire distributed system fail overnight?
  • Why is this service not starting up? Which parameters are set incorrectly?
  • Is the service configured correctly?

Logging may also be useful to your QA people, when they report bugs or verify whether an application or a service works correctly.

  • Which operations are failing?
  • Is there any additional evidence that confirms a user report?
  • What pages / services / operations should be improved?
  • What new pending tests need to be added?
  • How to prioritize the bug tickets?
  • Which items have missing or inadequate help information?

Performance issues are difficult to analyze and pinpoint a root cause for repair. Performance degradation may occur at certain times of the day, a period of high load, during an active DDoS attack, or what seems like no reason at all. QA can do performance testing on your application, but these tests rarely represent a realistic production environment with thousands of concurrent users.

Logging performance sensitive parts of the code and potential bottlenecks, you can retroactively find root causes of performance degradations. They could happen because of a poorly optimized database query, insufficiently allocated memory, or during a CPU spike. Monitoring resource usage over time will give you insights for when it is time to upgrade server hardware.

  • Which operations are taking too much time?
  • Which phase or stage of a process is taking more time?
  • Which resources are being unfairly allocated?
  • What is the primary cause of response delays?
  • Which nodes need more resources (CPU / RAM / Disk)?

Logging for business intelligence purposes should the state of business processes. Some examples: web clicks, financial trades, cell phone connections, audit trails, etc.

  • What is the purchase volume over time?
  • How do purchases compare to last month?
  • How are customers affected by app issues?
  • How is the customer experience?
  • Are transactions taking too long?
  • Where are transactions happening?
  • What is the target demographic?

Developers are good at logging exceptions, but they don’t usually account for security events. Security events can be any suspicious behaviors such as repeated patterns of failed login attempts, ACL changes or new accounts being created.

  • Who is accessing the app? When?
  • What activity looks suspicious?
  • Is the application behaving as expected?
  • When did this user log in and log off? Where did they login from?
  • Who executed these commands?
  • Which systems, data, applications, files, and networks were accessed?
  • Were there any failed access attempts?
  • Were there any hanges to system configurations?
  • Was there any activity from cybersecurity tools like the firewall or antivirus software?
  • Who is responsible for leaking sensitive data?

The sad reality

The sad reality of logging is that most existing systems are not following “the best practices” nor are they following “YOUR logging practices”. In fact, there are no existing guidelines for logging [2]. Unlike feature code, which is sometimes tested for correctness, it is very unlikely that anyone tests the the correctness of their logging code.

You have little or no control over what is and what is not reported by other people’s software. Most logging messages are written by developers to help them debug their code.

Developers insert logging statements in an ad-hoc manner, producing logs that are completely cryptic to the outsiders (and to the developers themselves after a passage of time), and mostly useless for other purposes.

The good news is that you have have full control over logging information that YOU write and there are tools that can process the unholy mess produced by others.

How much to log

Our take is that programmers are the Creator of software programs and the latter has nothing to hide to the former. The popular slogan hiding implementation details is for multiple pieces of code but not to programmers; for whom, ideally, all information should be handily available if needed. Thereafter, a software program is obligated to reveal its nontrivial execution details with appropriate density in an easy-to-follow way.[7]

How much to log is an age-old question for developers. Logging everything that happens in your application can be great because you have plenty of data to work from when you have a problem.

But logging too much can cause performance issues and result in a waste of disk space. Excessive logging may mask the truly important information.

On the other hand, logging too little or in an ad-hoc manner may result in developers missing important logging statements, which later results in difficulties when debugging issues in production systems.

For every application, you’ll probably find that you’re logging too much data in some areas and not enough in others. The problem is that you do not know beforehand what is important, and therefore is generally better to err on the “too much” side:

Flying more slowly is better than flying blind.

WHAT to log?

As you first develop your application, start by logging everything at a high level and then add more logging messages as you find yourself asking “What happened here?”

DO log enough context

One fourth (27%) of the log modifications are to variable logging. Majority (62%) of them are adding new variables that are causally related to the log message. It indicates that developers often need additional runtime information to understand how a failure occurred1.

Suppose you have the following snippet in your code,

Trust me, the moment you see this error actually happening, you will want to know the actual message type, message id, etc. Something went wrong, but what? What was the context?

If possible, always log enough context for a complete picture of what happened from a single log event. This is not always possible as the full context may be too large, but when it is possible, it will save you a huge amount of time when things do go wrong.

You should log:

  • All relevant variable values.
  • Timestamps.
  • User or request identity: user ids, order ids, transaction ids, request ids, ip addresses, user-agents.
  • Logical location: host names, node names, service ids.
  • Flow identifiers: correlation ids, span ids.
  • Sourcecode location: file name, function, method, class, file line.
  • Elapsed times.

✓ Relevant variables

Log any variables that are causally related to the log statement. We will later talk about how to do that in a compositional manner.

✓ Timestamps

Timestamps are critical to understanding the sequence of events for debugging, analytics, and deriving transactions.

  • Use the most granular timestamp format available.
  • Always use UTC, especially if your system is distributed. Using different formats such as EST, CST, GMT etc in different services will create confusion.
  • If writing semi-structured messages:
    • Place timestamps at beginning of events.
    • Use a human readable timestamp formats.

Example (good):

08/12/2014:09:16:35.842 GMT INFO key1=value1 key2=value2

✓ User or request identity

Log any unique identifiers that pertain to the current request or user, for example: transaction ids, user ids, request ids, service ids, node ids, ip addresses.

✓ Logical location

When using microservices, it is common to have multiple instances of the same component running at the same time. It is important to have a way to identify a particular instance from the log entry. This can be done by logging a host name, node identifier, process id, and a unique service identifier.

It doesn’t really matter how you generate these ids as long as they are unique and allow you to trace back to the exact server/container and application that generated the entry.

✓ Flow identifiers

Create an correlation identifier that you can trace all the way through from the initial request and through all subsequent processing and use it everywhere.

You will need to transport it between services, so if you call out to other microservices, you should pass the correlation id each time. If someone calls your service, and you are passed a correlation id, use it. Return this correlation id to the web browser at the end of the request.

Use an X-Correlation-ID header to include correlation IDs in HTTP requests. Ideally, your internal request API should add these headers transparently, so you don’t have to rely on developers remembering to add the header each time.

Consider augmenting the correlation id as your system processes a request with a short span id: an id generated by each microservice itself, that is passed when it is calling out to other microservices.

Consider using base-62 (or base-58) encoded numbers for ids as they are short and easily recognizable.

Be aware of the birthday paradox when generating random ids.

✓ Source location

Include class, function, and filename locations into your log messages. Consider including stack traces.

✓ Elapsed time

When logging the end of some process consider adding the elapsed time, as this would allow the reader to understand the performance of the system without correlating multiple events.

DO log startup parameters

✓ Configuration

Heuristic H12: If the verb describes initialization of the software or configuration of the software, then the event must be logged. [8]

✓ Application version

When software is running at multiple different locations, versions between them may differ. The difference in application versions means that bugs that are fixed on one environment may still be present on another.

DO be proactive with debugging

✓ Outgoing requests

If you are working with a distributed system, consider logging every piece of data that comes out from your application and gets in. Integration is a tough job and diagnosing problems involving multiple services is particularly hard.

Sometimes the amount of information exchanged with external systems makes it unacceptable to log everything. However, during testing and for a short period of time on production (for example when something wrong is happening), we would like to be able to have everything saved in logs, ready to pay performance cost.

This can be achieved by carefully using logging levels:

✓ Unacceptable performance

It is very tempting (and common) to make assumptions about how your code performs, but those assumptions can blind you to some very common performance issues. For example, a call to your database “should always be fast” so you might be tempted not to instrument it.

FIXME

Log anything outside the range in which you expect your system to perform. If you expect some part of your system to perform at a certain level, you should make sure you log when it falls outside that range. For example, if your database “should always be fast,” then you can log any DB access that takes more than, say, 100ms. When the inevitable happens and the DB does slow down, you’ll see that this has happened immediately thanks to this log statement.

✓ Paths without sufficient test coverage

As shown earlier in Table 3, 57 failures are hard to detect via generic exception checks. We refer them as hard-to-check errors. When a production failure occurs, it is usually due to an unusual input or environment triggering some code paths that are not covered during in-house testing. Table 6 shows that 21% of the 57 hard-to-check failure cases execute some branch edges that we surmise have never been executed during testing (otherwise, the bugs on those edges would definitely have been exposed). Therefore, if we log on those branch decisions that have not been covered during testing, i.e., cold paths, it would be useful for diagnosis. Of course, special care needs to be taken if some cold paths show up too frequently during runtime.[10]

DO log exceptional conditions

Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases. [10]

If you encounter a minor problem that can be automatically resolved, log it. A minor hiccup might be a symptom of a bigger problem. Masking the effects of an error by patching it is known in the parlance as a ‘kludge’. It is not a compliment.

If you encounter a major but resolvable problem, you may attempt to resolve it, but make sure to log it ASAP.

  • Your resolution may impair system performance.
  • Your resolution or the original problem itself may result in future problems.
  • The underlying problem may be a usability issue that is annoying to the users.
  • It may be a symptom of a bigger problem.
  • Your resolution may not go as planned, and it would be nice to have a log to refer to.

Log any unresolvable errors.

Please refer to the “compositionality” section on how exactly to do error reporting and recovery correctly.

✓ Log eagerly

Log errors as soon as possible (but don’t overdo it, especially if you are doing FP, as it would affect the compositionality of your code, see below). Sometimes systems fail before an error can be processed up the chain of execution. If you have a log, you would know what had happened, even if the condition caused the program to crash on its way back.

✓ Log eagerly (FP)

If you are doing functional programming, you do not have to worry as much about correct error propagation up the chain of execution in pure & total code, but you should still be careful to log errors eagerly (think IO, not WriterT). Consider the following Scala code:

Will you see the log output? Well, that depends entirely on the way Monad[F] and log work. If F[x] = WriterT[Either[E, ?], List[L], A] ~ Either[E, (List[L], A)], then you will lose your log messages as soon as something throws an error. EitherT[Writer[L, ?], E, A] ~ (List[L], Either[E, A]) will properly propagate log messages.

WriterT[IO, List[L], A] is basically WriterT[Either[Throwable, ?], ...] w.r.t. error handling, so it will lose log messages in a simialar fashion. I recommend using final tagless approach in Scala and supplying Logging[IO] algebras that log eagerly.

✓ Log exceptions with stack traces

Log messages reduce median diagnosis time of production-run failures between 1.4 and 3 times (on average 2.2x speedup)1.

Different from the original study, the median BRT for BWLs [bug reports with logs] is longer than the median BRT [bug resolution time] for BNLs [bug reports without logs] in server-side projects and SC-based [supporting-component] projects. The BRT for BNLs is statistically different from the BRT for the BWLs in nearly half of the studied projects (10/21). However, the effect sizes for BRT between the BNLs and BWLs are small.[3]

One way to interpret the studies on diagnosis time vs availability of log files is that both log files and stack traces reduce the issue resolution time when taken independently, but logs being reported is negatively correlated with the availability of stack traces (since people report logs only in cases where there is no obvious stack trace), increasing the issue resolution time.

Which is to say, make sure that you print exception stack traces into the log files, if they are available, that should have a significant effect on your issue resolution times.

✓ Not a replacement for error handling

Logging exceptions and error conditions is one of the most important roles of logging, but many developers tend to treat logging as a way to handle exceptions. They sometimes return default value (typically null, 0 or an empty string) and pretend that nothing has happened.

There may be some exceptions to this at the top-level of an application, but library code should generally propagate error information outwards.

✓ No catch-log-rethrow [11]

Sometimes developers first log the exception, then wrap it and throw it back:

Something eventually catches the exception again and logs its cause. As a result, the log files contains the same exceptions several times on several stack levels.

The problem isn’t the local catch block, a sign of good error handling practices, but the log-and-rethrow part. Either handle the exception or wrap it with a new exception that adds additional context and throw that.

If you handle the exception, you can easily downgrade the importance of the exception log entry and there is no reason to percolate that exception up the chain. It’s already dealt with.

Handling an exception can include informing users that a problem happened, logging the event, or simply ignoring it.

If you intentionally ignore an exception, I recommend providing a comment in the empty catch clause that clearly indicates why. This lets future maintainers know that it was not a mistake or lazy programming. Example:


Adding additional context (like the line number and file name in parsing code) is extremely helpful for debugging purposes. Note that re-wrapping an exception in an ApplicationException just to rebrand it doesn’t help you debug. Make sure you add additional information.

In this final case, you just allow the exception to leave without touching it. The exception handler at the outermost layer can handle the logging. The finally clause is used to make sure any resources needed by your method are cleaned up, but this is not the place to log that the exception was thrown.

DO log more than debug events

Log anything that can add value when aggregated and/or visualized:

  • User actions.
  • Timing.
  • Transactions.
  • Audit trails.

Log data that is critical for compliance or auditing purposes. Log data that may help you troubleshoot performance problems, solve user-experience issues or monitor security-related events. Avoid logging highly sensitive data, such as credit-card numbers.

Log any data that could be useful to your ops and qa people.

✓ CRUD Actions

Heuristic H1: If the verb [describing an action] involves creating, reading, updating, or deleting resource data in the software system, then the event must be logged.[8]
[…]
If the verb can be accurately rephrased in terms of creating, reading, updating, or deleting resource data in the software system, then the event must be logged.
[…]
The unconstrained natural language used (specifically in use-case based requirements and user guides) may not easily map to the core CRUD actions. For example, “designate” appears in the iTrust use-case based requirements and OCS user guide. In these cases, we attempt to mentally rephrase the action using a core CRUD action. For example, “A patient designates a patient representative” can be mentally rephrased as “create a patient representative in the patient’s list of patient representatives”. [8]

Always record CRUD actions (create, read, update, delete).

✓ Read/View Actions

If the verb implies the system displaying or printing resource data that is capable of being viewed in the user interface or on paper, then the event must be logged.[8]

FIXME

Logging what a user has seen is an extension to logging the data that is communicated between server and client. This logging could be performed by taking screenshots or reproducing a view with the help of a log event. Seeing what a user has seen makes iteasier for a developer to determine what exactly went wrong for the user and is especially usefulfor front end errors.

Log HTTP Request Header information: The HTTP Request Header contains informationlike URL and its parameters, HTTP response code, HTTP method, etc. Especially the URL andits parameters are important to localize a bug in the code, because those reveal which method wascalled by the user.Log user information: Logging user information is important. Logging the minimal userinformation, the user ID, is needed to properly query the log events, as well as tracing the activityof a user.Log data that is communicated between server and client: As an extension to tracinguser activity, developers would like to see the data that is send and received by a user. Developerscan use this data to quickly localize, reproduce and debug a bug, since the information containsthe values of variables at the moment a bug occurred.

✓ Actions that Express Permissions

If the verb expresses the granting or revocation of access privileges in the software system, then the event must be logged.[8]
[…]
In “allow doctors to edit immunizations”, the edit action is classified as a mandatory log event. However, the term “allow” suggests the use of an access control security mechanism in the software system. In this example, and based on prior research on security events that should be logged, we consider “allow” equivalent to “grant a user privilege” in an access control mechanism in the software.

✓ User session events

If the verb involves the creation or termination of a user session, then the event must be logged.[8]

Any action that involves the creation or termination of a user session must be logged.

✓ User activity

Trace user activity/path: When there is a bug that is caused by invalid data, the developers are interested in the steps a user took before the bug occurred. There is a chance that a previousstep in the user’s process caused the insertion of the invalid data. To fix such a bug, the developersneed to alter code in a step earlier in the process than the location of the error. Having the abilityto trace user activity, i.e. follow the path the user took through the application, will help the developers in determining which code caused the malformed data.

DO log system metrics

System metrics are a way to monitor the health and performance of a system at any given time, such as CPU usage, RAM usage and request durations. Mostly these metrics areused to monitor performance in real time and up or down scale the system when needed. Developers are interested in the performance history when attempting to find the cause of a bug. They willlook at the metrics at the time an error occurred to see if there were any anomalies. For example,there could be an unusual high number of requests at the moment an error occurred, which couldgive an indication to its cause.

HOW to log (formatting)?

DO use structured logging

Developers seldom delete or move logging code, accounting for only 2% of all the log modifications. Almost all (98%) of the log modifications are to the content of the log messages1.

Most logging “frameworks” force you to use regular strings in your log statements:

This is not machine-friendly, and it is difficult to process such a message in an automated fashion without essentially retroactively typing them. You should make your log format structured - machine (and ideally human) readable.

Consider using developer, human, and machine readable formats, such as JSON.

✓ If using a text format

If you are adamant about writing text messages, consider using key-value pairs for your message format. They don’t require complex processing and can be parsed with regular expressions:

Using a text format puts certain restrictions on the types you can put into your logs, you should avoid printing complex data types unless you can serialize them compactly.

You should avoid using binary data (e.g. base64 encoded) as it is not human-readable and is not easily searchable. If data must be in binary format, place textual metadata in the event. For example, do not log just the binary data of a JPG file, but do log its image size, creation tool, username, camera, GPS location, etc.

✓ Use consistent formatting

Even if you use multiple different formats for different services or applications, you should make sure that your logs get reformatted into a consistent format before they get shipped to a centralized storage. This will simplify the process of analyzing and aggregating information from multiple logs.

DO separate logs with different formats into different files

If a single application has to output events in different formats, separate them into individual files.

If your logs are consumed by multiple different parties, e.g. system administrators and business analysts, it is best to produce a separate log file for each group. They have different needs, and appropriate log files will have different “expiration times”. Support logs might be kept for a week, marketing logs might be rotated every day, audit logs might be kept for years. Keeping all these messages in a single file makes it impossible to archive only audit logs, and will typically make automated parsing much harder.

HOW to log (coding)?

Logging levels

Developers spend significant efforts on adjusting the verbosity level of log messages, accounting for 26% of all the log improvements. Majority (72%) of them reflect the changes in developers’ judgement about the criticality of an error event.
[…]
In 28% of verbosity level modifications, developers reconsider the trade-off between multiple verbosities. It might indicate that developers are often confused when estimating the cost (e.g., excessive messages, overhead) and benefit afforded by each verbosity level1.

Most logging libraries have a fixed set of severity levels, usually some combination of these:

Fatal/Critical – Overall application or system failure that should be investigated immediately. Yes, wake up the SysAdmin. Since we prefer our SysAdmins alert and well-rested, this severity should be used very infrequently. If it’s happening daily and that’s not a BFD, it’s lost it’s meaning. Typically, a Fatal error only occurs once in the process lifetime, so if the log file is tied to the process, this is typically the last message in the log.

something terribly wrong had happened, that must be investigated immediately. No system can tolerate items logged on this level. Example: NPE, database unavailable, mission critical use case cannot be continued.

Any error that is forcing a shutdown of the service or application to prevent data loss (or further data loss). I reserve these only for the most heinous errors and situations where there is guaranteed to have been data corruption or loss.

Error – Any error which is fatal to the operation, but not the service or application (can’t open a required file, missing data, etc.). These errors will force user (administrator, or direct user) intervention. These are usually reserved (in my apps) for incorrect connection strings, missing services, etc.

Definitely a problem that should be investigated. SysAdmin should be notified automatically, but doesn’t need to be dragged out of bed. By filtering a log to look at errors and above you get an overview of error frequency and can quickly identify the initiating failure that might have resulted in a cascade of additional errors. Tracking error rates as versus application usage can yield useful quality metrics such as MTBF which can be used to assess overall quality. For example, this metric might help inform decisions about whether or not another beta testing cycle is needed before a release.

Warning –- the process might be continued, but take extra caution. Anything that can potentially cause application oddities, but for which I am automatically recovering. (Such as switching from a primary to backup server, retrying an operation, missing secondary data, etc.)

This MIGHT be problem, or might not. For example, expected transient environmental conditions such as short loss of network or database connectivity should be logged as Warnings, not Errors. Viewing a log filtered to show only warnings and errors may give quick insight into early hints at the root cause of a subsequent error. Warnings should be used sparingly so that they don’t become meaningless. For example, loss of network access should be a warning or even an error in a server application, but might be just an Info in a desktop app designed for occassionally disconnected laptop users.

Warning - Actually I always wanted to have two levels here: one for obvious problems where work-around exists (for example: “Current data unavailable, using cached values”) and second (name it: ATTENTION) for potential problems and suggestions. Example: “Application running in development mode” or “Administration console is not secured with a password”. The application can tolerate warning messages, but they should always be justified and examined.

Information -– Important business process has finished. In ideal world, administrator or advanced user should be able to understand INFO messages and quickly find out what the application is doing. For example if an application is all about booking airplane tickets, there should be only one INFO statement per each ticket saying “[Who] booked ticket from [Where] to [Where]”. Other definition of INFO message: each action that changes the state of the application significantly (database update, external system request).

Generally useful information to log (service start/stop, configuration assumptions, etc). Info I want to always have available but usually don’t care about under normal circumstances. This is my out-of-the-box config level.

This is important information that should be logged under normal conditions such as successful initialization, services starting and stopping or successful completion of significant transactions. Viewing a log showing Info and above should give a quick overview of major state changes in the process providing top-level context for understanding any warnings or errors that also occur. Don’t have too many Info messages. We typically have < 5% Info messages relative to Trace.

  • DEBUG – Developers stuff. I will discuss later what sort of information deserves to be logged.

Debug is by far the most commonly used severity and should provide context to understand the steps leading up to errors and warnings. Having the right density of Debug messages makes software much more maintainable but requires some diligence because the value of individual Trace statements may change over time as programs evolve. The best way to achieve this is by getting the dev team in the habit of regularly reviewing logs as a standard part of troubleshooting customer reported issues. Encourage the team to prune out Debug messages that no longer provide useful context and to add messages where needed to understand the context of subsequent messages. For example, it is often helpful to log user input such as changing displays or tabs.

Trace –- Very detailed information, intended only for development. You might keep trace messages for a short period of time after deployment on production environment, but treat these log statements as temporary, that should or might be turned-off eventually. The distinction between DEBUG and TRACE is the most difficult, but if you put logging statement and remove it after the feature has been developed and tested, it should probably be on TRACE level.

Only when I would be “tracing” the code and trying to find one part of a function specifically.

The distinction being that Trace messages are compiled out of Release builds. That said, we discourage use of Trace messages. Allowing Debug messages tends to lead to more and more Trace messages being added and none ever removed. In time, this makes log files almost useless because it’s too hard to filter signal from noise. That causes devs to not use the logs which continues the death spiral. In contrast, constantly pruning Trace messages encourages devs to use them which results in a virtuous spiral. Also, this eliminates the possibility of bugs introduced because of needed side-effects in debug code that isn’t included in the release build. Yeah, I know that shouldn’t happen in good code, but better safe then sorry.

Severity tags?

There is however a logging level that I like to see in my error logs when wearing the hat of a sysadmin as much as that of tech support, or even developer: OPER, for OPERATIONAL messages. This I use for logging a timestamp, the type of operation invoked, the arguments supplied, possibly a (unique) task identifier, and task completion. It’s used when e.g. a standalone task is fired off, something that is a true invocation from within the larger long-running app. It’s the sort of thing I want always logged, no matter whether anything goes wrong or not, so I consider the level of OPER to be higher than FATAL, so you can only turn it off by going to totally silent mode. And it’s much more than mere INFO log data - a log level often abused for spamming logs with minor operational messages of no historical value whatsoever.

As the case dictates this information may be directed to a separate invocation log, or may be obtained by filtering it out of a large log recording more information. But it’s always needed, as historical info, to know what was being done - without descending to the level of AUDIT, another totally separate log level that has nothing to do with malfunctions or system operation, doesn’t really fit within the above levels (as it needs its own control switch, not a severity classification) and which definitely needs its own separate log file.

The list above is just a suggestion, you can create your own set of instructions to follow, but it is important to have some. My experience is that always everything is logged without filtering (at least from the application code), but having the ability to quickly filter logs and extract the information with proper detail level, might be a life-saver.

So now I’m in the process of switching to logging based on a flag system. Everything that gets logged has a flag detailing what kind of operation it is, and there’s a set of checkboxes allowing me to define what gets logged. Typically that list looks like this:

The software typically ships with ERROR, BASIC, STATE_CHANGE and EXCEPTION turned on, but this can be changed in the field via the debug dialog (or a registry/ini/cfg setting, where these things get saved).

Log levels are non-compositional

DO NOT duplicate error level in the message

Although the verbosity level is set to INFO, the text suggests that this log statement is used for debugging purposes. The fix is to first change the logging level to DEBUG:

Now we have duplication of information about severity level, which we can fix as well:

MIND complex expressions in log statements in unsafe languages

Every time you write (or review) a log statement, check if there are any potential NullPointerExceptions like:

or unsafe casts like

Note that

is not a good solution either.

DO NOT put side-effects into your log statements

Be careful with side-effects in your logging statements. If you fetch a “collection” of domain objects from the database using Hibernate and carelessly log them:

You will be inadvertendly calling toString() on this “collection”, which might result in a variety of problems such as OOM, N+1 select problem, thread starvation (logging is synchronous!), lazy initialization exception, etc.

DO NOT use multiple statements for one logical event

Each logging statement should represent a single execution event.

should be rewritten as:

DO be lazy

In log4j, isLogLevelEnabled methods, or similar methods in other logging frameworks.

Since Java before Java 8 does not have any form lazy evaluation, any expression passed to the logger methods is executed even if the logger just ignores the String that is created. This is a waste when doing things like logging complex data structures.

Thus, code like:

In Java 8, the built-in library Logger class supports lambdas through a Supplier to evaluate the expression to log only if the log level is met.

It looks like other frameworks still don’t have something similar, unfortunately.

DO log before an action

A common example is logging a SQL command after the execution. In case of an exception, the log entry might not be written, so the audit log will not provide relevant contextual information for the exception. Audit log entries should be written before an action - not after.

HOW to log (operational)?

Implement a log security and retention policy

Logs contain sensitive data. A log security policy should review sensitive data – like personal data of your clients or internal access keys for APIs. Make sure that sensitive data gets anonymized or encrypted before you ship logs to any third party. GDPR log management best practices teaches you about good practices for data protection of sensitive data and personal data in web server logs. The secure transport of log data to log management servers requires the setup of encrypted endpoints for TLS or HTTPS on client and server side.

Logs from different sources might require different retention times. Some applications are only relevant for troubleshooting for a few days. Security-related or business transaction logs require longer retention times. Therefore, a retention policy should be flexible, depending on the log source.

Log Storage

Planning the capacity for log storage should consider high load peaks. When systems run well, the amount of data produced per day is nearly constant and depends mainly on the system utilization and amount of transactions per day. In the case of critical system errors, we typically see accelerated growth in the log volume. If the log storage hits storage limits, you lose the latest logs, which are essential to fix system errors. The log storage must work as a cyclic buffer, which deletes the oldest data first before any storage limit is applied.

Design your log storage so that it’s scalable and reliable – there is nothing worse than having system downtimes and a lack of information for troubleshooting, which in turn, can elongate downtime.

Log storage should have a separate security policy. Every attacker will try to avoid or delete his traces in log files. Therefore you should ship logs in real-time to the central log storage. If the attacker has access to your infrastructure, sending logs off-site, e.g., using a logging SaaS will help keep evidence untampered.

Review & constantly maintain your logs

Unmaintained log data could lead to longer troubleshooting times, risks of exposing sensitive data or higher costs for log storage. Review the log output of your applications and adjust it to your needs. Reviews should cover usability, operational and security aspects.

Repeating log messages

I call it flooding when I see multiple similar messages coming from a loop with many iterations. To me, flooding is annoying enough to try to get rid of it when I detect it in source code. Usually improving it requires some art - because, well, things that happen within the loop may be interesting. When I don’t have time to improve it deeper, I try to at least change logging level of such messages to lowest one to make it easier to filter out.

! To avoid flooding I usually gather the heuristics of the loop and output it after the loop. Meaning anything interesting happening in the loop should be stored in a variable (like somethingSpecialHappenedCount) and then output to the logger.

DO NOT commit meaningless messages

Senseless messages seem to be pretty popular garbage. These look harmless when read in source code - I guess one has to go through the pain of analyzing debug output looking like…

…to deeply appreciate their inherent ugliness.

My favorite heuristics to detect this kind of problems at source code level (proposed by colleague in one of my past projects) is to calculate number of space symbol occurrences in string literals used in logging. In my experience, zero spaces basically guarantees the logging statement is senseless, one space is also a good indicator of the potential issue.

Operational Best Practices

Log locally to log files. Provides local buffer Non-­‐blocking during network failures Use syslog-­‐ngor rsyslog+

Implement rotation policies Logs take up space Many compliance regulations require years of archival storage Decide on destroying or backing up logs

Collect events from everything, everywhere Application logs database logs network data configuration files performance data time-­‐based data

More data captured = more visibility

Shielding users from important messages that do not pertain to them.

Send Logs to a Centralized Location

Let’s assume that you’re already adding all sorts of useful information to your logs. But it’s essential to send logs to a centralized location. Think about it. If you have to log in to each individual server to read logs, you’ll spend more time trying to correlate problems than if you just had one location where you could access all logs. Also, systems usually get more complicated as time goes by, so the amount of microservices usually grows too. And to make things even more complicated, services could be hosted on different servers or providers.

Don’t do log analysis in a silo: Correlate all data sources

Connect the dots. Logging is one part of an entire monitoring strategy. To practice truly effective monitoring, you need to complement your logging with other types of monitoring like monitoring based on events, alerts and tracing. This is the only way to get the whole story of what’s happening at any point in time. Logs are great for giving you high-definition detail on issues, but this is useful only once you’ve seen the forest and are ready to zoom into the trees. Metrics and events at an aggregate level may be more effective, especially when starting to troubleshoot an issue.

Don’t look at logs in a silo — Complement them with other types of monitoring like APM, network monitoring, infrastructure monitoring, and more. See APM vs. Log Management for more detail. This also means that the monitoring solution you use should be comprehensive enough to provide all your monitoring information in one place, or flexible enough to easily integrate with other tools that provide this information. This way, as a user, you have a single-pane view of your entire stack.

Consider splitting logs (locally)

Eg in my recent project QA guys asked for dedicated files for metrics and timing data and for brief reports on system operations. They said they will benefit from that and dev did it (benefit from brief reports file turned out indeed significant).

Make log level configurable

Some applications logs are too verbose and other application logs don’t provide enough information about the activities. Adjustable log levels are the key to configure the verbosity of logs. Another topic for log reviews is the challenge to balance between logging relevant information and not exposing personal data or security-related information. If so, make sure that those messages can be anonymized or encrypted.

Inspect audit logs frequently Acting on security issues is crucial – so you should always have an eye on audit logs. Setup security tools such as auditd or OSSEC agents. The tools implement real-time log analysis and generate alert logs pointing to potential security issues. On top of such audit logs, you should define alerts on logs in order to be notified quickly on any suspicious activity. For more details, check out a quick tutorial on using auditd, plus you’ll find some complementary frameworks too.

In production you’ll probably log info and above, but if you have problems in specific components then you should be able to change the tracing to debug or verbose in order to capture the required diagnostic information.

Set Alerts

Logs are often used as a source of historical data, but they can also be used to identify ongoing problems and predict potential issues. If a service fails or becomes unresponsive, you’ll want to be notified immediately. Use logging to identify automation opportunities & trends Logging helps to catch issues early on in the pipeline and saves your team valuable time and energy. It also helps you find opportunities for automation. You can set up custom alerts to trigger when something breaks, and even set up automated actions to be initiated when these alerts are triggered. Whether it’s through Slack, a custom script, or a Jenkins automation plugin, you can drive automation in your GitOps process using logs. For all these reasons, you need to view logging as an enabler and driver of GitOps rather than an add-on.

Most applications log exceptions so that developers can look back into errors and provide updates. But not every exception is created equally. The most critical events are those that impact business revenue. These events are the ones you should monitor closely and send alerts to the right response team.

It shouldn’t be your customers calling to tell you that there are bugs in your application. You want to find them before they do, and many customers will just bounce to another vendor if you have too many bugs. Many of your customers won’t report application issues at all, so you could be losing sales every hour and never know it.

When you build a shopping cart, you should have a point where you ask a visitor for a way to contact them. Usually, email is a common input field during account creation. With an alert on failed shopping cart activity, you have a way to contact a customer should they receive an error and bail on the shopping experience. Alerts are a great tool to salvage a lost customer due to application bugs.

But you also need alerts to tell you when components of your application are creating obstacles for customer shopping experiences. It could be an alert for performance (similar to the previous example), or your customers could be dropping at a specific point in the sales funnel. Alerts give you insights into the efficacy of your marketing and user experience. Use them generously to identify issues with your sales pages and find solutions for a better user experience.

They also help you pinpoint components of your application that could be causing performance degradation. Your developers would need to set a baseline, but — for instance — you could set an alert for any event that takes longer than 100 milliseconds to process. When you see a pattern, you can then have developers research more into these application components for better optimization.

When CPU usage spikes to over 80%, set an alert to inform administrators. It could be something as simple as upgrading RAM or even your server’s CPU, but having these log alerts will give you the ability to analyze the time of day and any patterns surrounding application procedures.

Other considerations

Visualization

Trends tell the story better than data points.

IP protection

In proprietary software, tracing data may include sensitive information about the product’s source code.

Anonymization

[9]

Generalization: Since we do not know the identifiers and quasi-identifiers, everything wouldneed to be generalized. This has a large impact on the utility of the data, let alone define ageneralization script for any type and sub type of data.

•Suppression: Suppression means removing identifiers and quasi-identifiers from the data,but since the system does not know the fields which are identifying, this technique cannot beapplied.

•Pertubation: This technique does not apply since we do not have any statistical data norare we interested in the statistical value of the data.

•Anatomization: Anatomization aims to split the (quasi-)identifiers of a dataset and the restof the dataset into two separate datasets. However, we do not know the (quasi-)identifiers sothis technique is disregarded.

Therefore, the chosen technique is permutation. Permutating every piece of data by default isa best practise in anonymization in order to minimize the chance of releasing any privacy sensitiveinformation. The requirements do leave some data types un-permutated with booleans and NULLvalues or less-permutated in the case of strings and when developers use ranges. Therefore, theprivacy risks that come with the chosen anonymization method will be discussed in detail in thissection.38

What should be done with non-alphanumeric charactersin strings? One of theexamples had a long string in the data with characters such as ß or %&$. The developers agreedthat the characters from languages with other alphabets should be anonymized since they can alsocontain sensitive information. Bugs related to non-Roman characters are becoming increasinglyrare since current applications can handle them out of the box. The developers reasoned thatnon-alphanumeric characters should be preserved, since they sometimes have special meanings instrings.

What should be done with dates during anonymization? What should be done with filenames during anonymiza-tion? Is there a difference between the anonymization of namedentities and unstructured text? Should the structure of the string be preserved (length,spaces, casing, separators)?

What should be done with JSON during anonymization?An example was takenfrom a real system log, where there was a JSON string in the log data. The developers came toa quick conclusion: the anonymization tool should work on JSON data, so then the tool shoulddecode the JSON and recursively anonymize that JSON and replace the existing JSON string witha string representation of the anonymized JSON.

What should be done with dates during anonymization?After discussing for sometime, the developers concluded that dates can be quasi-identifiers. One person noted that there is adifference with standalone dates and dates inside text, while the two are both represented by stringswhen dealing with JSON. Then they agreed that standalone dates should be preserved, since theyare parsed server side and could cause bugs, while dates inside text do not need to be preserved asthey are not parsed and therefore not sensitive to faults.

What should be done with filenames during anonymization?In some parts of thesystem, users upload files and the filename they upload is preserved, which means that a filenamecan occur in log data. A filename can include privacy sensitive data such as a company name. Thedevelopers agreed that the same anonymization method as for normal strings should be used forthis, but that the file extension should be preserved, because a file extension is not privacy sensitivewhile it can give information about the cause of an error.

What should be done with numbers during anonymization?The developers wereconsidering the different uses of a number. When a number is used as an ID for an object, it isnot privacy sensitive since the developers have no access to the production database. Even if adeveloper has access to that production database, using IDs from anonymized logs is probably the least of the company’s problems since they actually have a data leak (a developer having access tothe production database) at that point. Numbers are also used for enum types. An enum in C# isrepresented as an integer in JSON. The developers agreed it would be useful to see the real enumvalues, since they are often used in if-statements and queries. For the rest of the uses, personalnumbers, amounts, percentages, etc, the developers thought it would be best to randomize themwhile preserving the length of the number.

Are outliers of any interest?Before thinking if outliers are useful, the developersconcluded that it would not be possible to detect outliers reliably. For example, the anonymizationtool cannot know if a date with value 2100-01-01 is an outlier for a particular field. The develop-ers agreed that outliers could be an indication of an error, but not anonymizing them would bevulnerable to privacy leaks.

Ds of objects are not privacy sensitive. Most enums will not be privacysensitive, but some will be. Regardless of whether those fields are privacy sensitive or not,detecting whether a number is an ID or an enum is a problem on its own. While it wouldbe possible, as the developers suggested, to record which fields are enums in log data, suchtechnique is not possible for IDs since they are just denoted as integers. Furthermore, such anenum detection method will cost much time to implement. Name recognition could be usedto detect fields with ”ID” in their names, but that will not cover all fields that are IDs or give

Standalone dates should be preserved according to the developers. However, datescan be strong quasi-identifiers, e.g. date of birth, which means that standalone dates shouldbe anonymized.

Preserving string length plus all white spaceand non-alphanumeric characters in strings can be a privacy risk when a creditor can beidentified by a unique combination of length and positions of these preserved characters. Arisk assessment will be conducted to determine the privacy impact of preserving string lengthand these characters.

Whether to disclose a boolean depends on its distribution. For even distri-butions, the boolean can be disclosed safely. However, it is not always possible to reliablydetermine a distribution during anonymization, since log data for a certain model or viewcan be scarce. Therefore, two options remain: randomize the boolean value or disclose theoriginal value. Naturally, a randomized value does have a negative impact on utility whiledisclosing the original value can possibly have an impact on privacy, but only if booleans onthemselves can be privacy sensitive. A risk assessment will be conducted to determine theprivacy impact of disclosing booleans.

Technical considerations - Function purity

Technical considerations - Logging vs compositionality

My main problem with logging levels is that they are non-compositional: only the caller knows whether a failure in a function is a “critical failure” in the context of the surrounding computation.

If you’re a new developer working on this project for the first time, or a client who is testing my software, it is reasonable to think that something terrible has happened. Since the log file looks so concerning, you might spend an hour or two investigating these errors before realizing that these I2C write failures are expected and acceptable. This particular program supports dynamic detection of connected sensors using an I2C write. If the sensor is not connected, the write will fail and we will move on to the next slot. Even worse, the same I2C error spew occurs if you try to run an I2C bus sweep to detect connected devices. The log becomes practically worthless in such a situation, as each I2C sweep call would generate up to 128 I2C error prints. The obvious answer is to disable the log statement, but the underlying I2C layer was provided to us as a pre-compiled library. This example is hardly a shining beacon of useful error logging, but is indicative of typical logging experiences.

Technical considerations - Fixing the set of errors

It’s going to be almost impossible to have a defined format for log data; some logs might need more fields than others, and those that don’t need all those excess fields will be wasting bytes.

Because software tracing is low-level, there are often many more types of messages that would need to be defined, many of which would only be used at one place in the code. The event-code paradigm introduces significant development overhead for these “one-shot” messages.

The types of messages that are logged are often less stable through the development cycle than for event logging.

Localization

Because the tracing output is intended to be consumed by the developer, the messages don’t need to be localized. Keeping tracing messages separate from other resources that need to be localized (such as event messages) is therefore important.

Technical considerations - Log stability

New messages can be added often and there are no localization concerns. A standards-based output format may be desirable.

References