My logging system died rather abruptly one week. It started with the Active Directory account some of our servers use locking. I got the account unlocked–someone else has those rights–and the system came back to life for a while, but then we had to repeat, and each time we repeated, “a while” grew shorter and shorter, bottoming out at about 2 minutes, 40 seconds.
The way you troubleshoot problems like this is by looking at logs. The problem is, you can’t collect very many logs in 2 minutes and 40 seconds.
The logs I did have implicated the collectors. These are servers that reach out to the other Windows servers and pull in the logs and forward them on to centralized log servers for storage and search. Random servers were reporting incorrect user name or password events, attributed to the IP address of one of our busier collectors.
I looked, and, indeed, that collector was trying to collect from those servers. So I disabled the server, then ran a test with the account unlocked. It worked. I re-entered the credentials anyway, because doing this kind of thing for too long makes you paranoid. It worked. So I re-enabled the server, and, minutes later, the account locked.
The error messages reappeared. My collector was fat-fingering the password or something. Computers don’t fat-finger passwords. People do.
So, since I had a computer that was obviously on drugs, I checked the uptime. Way too long. That’s the thing about Windows these days–not all patches require reboots anymore, so you can get insane uptimes again with a little luck. And sometimes that’s not a good thing. Like in this case. So I rebooted all of the collectors. The collectors thanked me by promptly locking the account.
So, the next time I saw our red-team guy, I talked to him. Maybe he was doing something with that account? And maybe if I asked nicely, he would stop?
“Yeah, I saw that account’s been locked for a long time. I was wondering when someone was going to unlock it.”
I told him the story.
“I’ve seen that before. You’ll see events about a wrong password, but those events fire on the second time the account was accessed, and you zero in on that one and never get anywhere, like you’re chasing a ghost.”
That turned out to be the hint I needed, even though it still took a little while. Some of that wasn’t my fault.
With the vendor on the phone, I shut down all of the collectors. We looked at a few other things, including the log collectors’ own plaintext logs, then the vendor asked me a question.
“We have to get a look at the domain controller logs,” he said. “How are we going to get those when that account keeps locking?”
Well, by lucky break or brilliant design, we still had the domain controller logs. A different account collects the DC logs, and those were still humming along. So, off to the log server we went, and we searched on that user ID and the number 4740 to see what we could find.
What we found was my ghost–an IP address I didn’t recognize, using something written in Java (the telltale signature was JCIFS and a number), trying to collect logs using that user account.
So I did an nslookup on that IP address. It was our retired log product. But that couldn’t be. Those servers were dead. I’ve seen the bodies. They’re sitting unplugged in a closet, and we pull them out every once in a while when we need a throwaway server to write an image to.
So I sent an IM to a coworker who’s been with the company longer than six months, unlike me. He recognized the name. He thought that server was dead too, though he was less convinced than I was. Connecting to it with a web browser, got a bit of an answer in the form of a nonfunctioning product–but its telltale logo was right there in the upper left corner. We weren’t dealing with a ghost. We were dealing with a zombie.
The server didn’t respond over SSH. I didn’t ask if he tried Telnet. But he was able to connect to it via the management interface and shut it down.
And with that, the account lockouts ended.
This isn’t the first zombie server I’ve dealt with. In the first case, someone deliberately plugged a decommissioned server back in, turned it on, and continued using it. I’m sure in this case it was an accident, like someone bumping a power switch in the data center.
I learned some good lessons through the ordeal though.
Decommissioned servers have ways of finding themselves back in service. I saw it happen 15 years ago in a completely insecure environment, and now I’ve seen it happen in a highly secured datacenter. When you decommission a server, make sure it can’t come back. If you can’t unrack the server and haul it away right away, uncable it and take the cables with you when you power it down for what’s supposed to be the last time, at the very least.
Segment your logging accounts. Our domain controllers used a different account for log collection. That wasn’t something we saw as a good thing, but sometimes brilliant design happens unintentionally. Since we had multiple accounts in use–three, in fact–we still had access to some very useful logs even with one of the accounts locked. This was always supposed to be a temporary arrangement, but I’m going to push for it to be permanent.
Turn down the noise. I think the process that was truly locking the account was there in the logs all along, but it was buried from view. By turning the failing product completely off and leaving it off for a while, I could see there really was something else using the account. But this time, I actually saw it, rather than an endless stream of my system trying to use a locked account and having it logged as an incorrect password rather than simply a lock. I had to cast a wider net to find it–searching 60 minutes of logs rather than five–but by eliminating the noise, I found it, and rather quickly.