The Art of Troubleshooting

While this post is not strictly related to InfoSec, it does come into play in some form. Often I see people asking for help, when it is obvious that they haven't performed basic troubleshooting (things from the actually issue being reported in the logs, to not testing connectivity, etc). The purpose of this post is to introduce some basic steps I follow to try perform basic troubleshooting when I encounter any issues with systems which I deal with.

For this I will primarily use the example where you have someone report to you that users are unable to log into your system.

Step 1 - Logs

This should ideally be your first step. More often than not details of the issue will be present in the logs. This include things from stack traces, to error messages or even informational messages.

If a stack trace is present, make sure that you start at the bottom of the stack trace. Sometimes the top most exception may not give enough information. The issue may have been triggered by an lower level exception which should show lower down in log entry of the stack trace.

If the error is still not immediately obvious, try to attempt to trace through some of the logic to ensure that things are working as they should. In our example, are all users affected or only some? This could likely be confirmed by looking at the logs and confirming if any users were able to authenticate correctly.

Step 2 - Connectivity

So nothing appears in the logs. Next step is confirm if there is appropriate connectivity to the external systems or services used by your server/service. For example, in our example your server may use a database. Ensure that the server has connectivity to the server and the credentials which are configured, are still correct and valid.

Some tools which you could use include the following.

Netcat

The "swiss army knife" of networking tools. You can test connectivity using the following command:

nc -vv <hostname/IP> <port>

Where:

hostname/IP is the hostname or IP address of the system which you want to test connectivity to.
port is the port number of the service running on that system.

OpenSSL

Often at times there could be an issue with the certificate configuration or even TLS configuration. OpenSSL is a fantastic tool to help with troubleshooting:

openssl s_client -connect <hostname/IP>:<port> < /dev/null

Where:

hostname/IP is the hostname or IP address of the system which you want to test connectivity to. If the service is using SNI you will need to include the argument `-servername` followed by the hostname of the service.
port is the port number of the service running on that system.

Step 3 - Google

Search for the stack trace or any error messages. Search for the symptoms. 9 times out of 10 (OK I made that number up but it's likely not to be far off), someone else has faced the same issue that you have. Searching for it and getting answers from places such as Stack Overflow can literally save you hours of frustration.

Step 4 - Ask

If you don't find your answer from your search, next is to leverage social media such as Twitter, and see if anyone else encountered the same issue, or they might even be able to help you out.

Step 5 - Support

If you have a support contract, leverage that. Especially if you are paying for it. It's up to you to decide if this should in fact be your first step actually. Personally I leave this to a last resort since I prefer to fix things myself. By doing so I often find that I learn more about the system which can help for future use. There was once a case where I was working for a client, and I spent so much time troubleshooting issues that I had a 3rd party asking me for help about their own integration with the system which I was supporting!

Conclusion

As I think of more things to add, I shall update this post. But hopefully this will provide a suitable start and be helpfully to those who take the time to read it. Any feedback and comments are most welcome.