Troubleshooting at all layers of the OSI model

I saw this phrase in a job description last week: Troubleshooting at all layers of the OSI model. That sounds a bit intimidating, right?

Maybe at first. But let’s not overcomplicate it. Once you get past the terminology, it’s a logical way to locate and fix problems. Chances are you already do most of this whether you realize it or not. I was already troubleshooting at at least four of the seven layers when I was working as a part-time desktop support technician in college in 1995.

Laver 1, physical. Check your network cables, network cards, device drivers, link speed and duplex settings, and make sure all is well with them. Many problems end right here.

Layer 2, data link. Use the ARP command.

Layer 3, network. This is ping, tracert, route print (to see the routing table), and looking at frame headers with a sniffer.

Layer 4, transport. Look at the sequence of TCP/UDP headers with a sniffer to make sure they’re in the proper order.

Layer 5, session. This is protocol-specific. In Windows, it’s the NetBIOS layer–nbtstat is your friend. For other protocols, look at the relevant logs, check to see that the service or daemon is running and the port is open, telnet into the relevant port, things like that.

Layer 6, presentation. Is the data in the right format? Check relevant files and/or data streams to make sure they’re not corrupt. You may have to break out the hex editor, though I’ve been known to use notepad for quick and dirty tests–just don’t save the file after looking at it.

Layer 7, application. This usually involves installing missing patches, reinstalling patches, and potentially even uninstalling and reinstalling components.

So, what is there about troubleshooting this way?

If nothing else, it ensures that all possibilities get covered consistently. It’s not all that hard to commit to memory, and those who don’t have it memorized yet can scrawl it on the back of a business card, keep it in a wallet, and refer to it as necessary.

Historically, I’ve done less at layer 4 than anywhere else because I’ve worked places that are very averse to network sniffers, but they have their place. One could argue about the order to do layers 2-6 in, though I can’t think of a good argument against starting with cables and ending with application software. Once you have a recurring pattern of behavior you may deviate from it. As an example, I once inherited a server. A consultant built it years ago, and he was long gone. The installation software was long gone too, and the product was discontinued, so there was no support available from the vendor. The .NET Framework would die for no reason every few months, and the server would stop. The solution was to reinstall all of the .NET hotfixes and reboot. So when that server died, I skipped all the way to layer 7, because I knew I wasn’t going to find anything on layers 1-6. The right way to fix it was to actually tear the server down and build it right so that it wouldn’t break again, but that wasn’t an option in this case. This kind of thing shouldn’t happen, but most shops have at least one story like this. So when rebuilding isn’t an option and replacing isn’t either, you follow a standard troubleshooting procedure to find a fix, and then, you might develop a specific procedure to fix that particular issue if it becomes a recurring thing.

So what’s the point of having this in a job description? It helps you differentiate from a security person who spent an entire career on the policy side and one with hands-on experience in the trenches. There’s a time and place for both types of people, but they aren’t necessarily interchangeable.

If you found this post informative or helpful, please share it!