Troubleshooting intermittent PC problems

How to troubleshoot an intermittent PC problem. We’ve got an aging P2-233 at work that likes to bluescreen a lot under NT4–usually once every day or two. No one who looked at it was able to track it down. The first thing I noticed was that it still had the factory installation of NT, from about three years ago. Factory installations are bad news. The first thing you should do with any PC is install a fresh copy of Windows. If all you have are CAB files and no CD, don’t format the drive–just boot to DOS, go into that directory, run Setup, and install to a new directory other than C:Windows. With NT, it’s also possible to install from DOS though the syntax escapes me momentarily.

The first thing I suggested was to run RAM Stress Test, from , over the course of a weekend to eliminate the possibility of bad memory. I followed that by formatting the drive FAT and running SpinRite. After six hours, SpinRite gave the disk a completely clean bill of health.

Knowing the memory and disk were good, I built up the system, installing NT, then installing SP5 128-bit, then installing IE 5.01SP1, then installing Diskeeper Lite, then installing Office 97 and Outlook 98 and WRQ Reflection, then running Windows Update to get all the critical updates and SP6a. (Yes, the great hater of Windows works in a shop that runs Microsoft software almost exclusively on its PCs.) I ran Diskeeper after each installation to keep the drive in pristine condition–I find I get better results that way than by installing everything and then running Diskeeper.

The system seemed pretty stable through all that. Then I went to configure networking and got a bluescreen. Cute. I rebooted and all was well and remained well for an hour or two.

How to see if the bluescreen was a fluke?

I devised the following batch file:

  dir /w /s c:
goto loop

Who says command lines are useless and archaic? Definitely not me! I saved the file as stress.bat and ran 10 instances of it. Then I hit Ctrl-Alt-Del to bring up Task Manager. CPU usage was at 100%. Good.

The system bluescreened after a couple of hours.

How to track down the problem? Well, I knew the CD-ROM drive was bad. Can a bad CD-ROM cause massive system crashes? I’ve never heard of that, but I won’t write off anything. So I disconnected the CD-ROM drive. I’d already removed all unnecessary software from the equation, and I hadn’t installed any extraneous peripherals either. So with the CD-ROM drive eliminated, I ran 10 instances of the batch file again.

The system didn’t make it through the night.

OK. Memory’s good. Hard drive’s good. Bad CD-ROM drive out of equation. Fresh installation of OS with nothing extra. What next?

I called my boss. I figured maybe he’d have an idea, and if not, he and I would contact Micron to see what they had to suggest–three-year warranties and a helpful technical support staff from a manufacturer who understands the needs of a business client are most definitely a good thing. The day Apple manages to figure that out will be the first step towards capturing and keeping more than six percent of the market. But I digress.

My boss caught the obvious possibility I missed: heat.

All the fans worked fine, and the CPU had a big heatsink put on at the factory that isn’t going anywhere. Hopefully there was thermal compound in there, but if there wasn’t, I wouldn’t be getting in there to put any in, nor would I be replacing the heatsink with a heatsink/fan combo. So I pulled the P2-333 out of the PC I use–it was the only 66 MHz-bus P2 I had–and put it in the system. I’d forgotten those old P2s weren’t multiplier-locked, so the 333 ended up running at 233. That’s fine. I’ve never had overheating problems with that chip at its rated speed, so at 100 MHz less, I almost certainly wouldn’t run into problems.

With that CPU, the system happily ran 10 instances of my batch file for 30 hours straight without a hiccup. So I had my culprit: That P2-233 was overheating.

Now, ideally a stress test would tax more system memory than this one did and would force some floating-point operations as well. So for your home system, a good stress test might be to load up several FPS games and let them run in demo mode continuously for a while. A command-line MP3 encoder, encoding the same WAV file and then deleting the resulting MP3 file over and over in a continuous-loop batch file also would suffice to put the floating-point unit to use and would also force the disk into action.

If you have time and parts available, you can troubleshoot a recalcitrant PC by running such a real-world stress test, then replacing possible suspect parts (CPU, memory, hard drive, motherboard) one at a time until you isolate the problem.

%d bloggers like this:
WordPress Appliance - Powered by TurnKey Linux