Using your logs to help track down spammers and trolls

It seems like lately we’ve been talking more on this site about trolls and spam and other troublemakers than about anything else. I might as well document how I went about tracking down two recent incidents to see if they were related.
WordPress and b2 store the IP address the comment came from, as well as the comment and other information. The fastest way to get the IP address, assuming you haven’t already deleted the offensive comment(s), is to go straight to your SQL database.

mysql -p
[enter the root password] use b2database;
select * from b2comments where comment_post_id = 819;

Substitute the number of your post for 819, of course. The poster’s IP address is the sixth field.

If your blogging software records little other than the date and time of the message, you’ll have to rely on your Apache logs. On my server, the logs are at /var/log/apache, stored in files with names like access.log, access.log.1, and access.log.2.gz. They are archived weekly, with anything older than two weeks compressed using gzip.

All of b2’s comments are posted using a file called b2comments.post.php. So one command can turn up all the comments posted on my blog in the past week:

cat /var/log/apache/access.log | grep b2comments.post.php

You can narrow it down by piping it through grep a bit more. For instance, I knew the offending comment was posted on 10 November at 7:38 pm.

cat /var/log/apache/access.log | grep b2comments.post.php | grep 10/Nov/2003

Here’s one of my recent troublemakers:

24.26.166.154 – – [10/Nov/2003:19:38:28 -0600] “POST /b2comments.post.php HTTP/1.1” 302 5 “https://dfarq.homeip.net/index.php?p=819&c=1” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031007 Firebird/0.7”

This line reveals quite a bit: Besides his IP address, it also tells his operating system and web browser.

Armed with his IP address, you can hunt around and see what else your troublemaker’s been up to.

cat /var/log/apache/access.log | grep 24.26.166.154
zcat /var/log/apache.access.log.2.gz | grep 24.26.166.154

The earliest entry you can find for a particular IP address will tell where the person came from. In one recent case, the person started off with an MSN search looking for information about an exotic airplane. In another, it was a Google search looking for the words “Microsoft Works low memory.”

You can infer a few things from where a user originally came from and the operating system and web browser the person is using. Someone running the most recent Mozilla Firebird on Linux and searching with Google is likely a more sophisticated computer user than someone running a common version of Windows and the version of IE that was supplied with it and searching with MSN.

You can find out other things about individual IP addresses, aside from the clues in your logs. Visit ARIN to find out who owns the IP address. Most ARIN records include contact information, if you need to file a complaint.

Visit Geobytes.com IP Locator to map the IP address to a geographic region. I used the IP locator to determine that the guy looking for the airplane was in Brooklyn, and the Microsoft guy was in Minneapolis.

Also according to my Apache logs, the guy in Brooklyn was running IE 6 on Windows XP. The guy in Minneapolis was running Mozilla Firebird 0.7 on Linux. (Ironic, considering he was looking for Microsoft information.) It won’t hold up in a court of law, but the geographic distance and differing usage habits give at least some indication it’s two different people.

Yes, I’m still alive

I had to take some time away to clear my head and find myself. It’s a survival tactic; the guy other people wanted Dave to be hasn’t been getting the job done.
Besides, anyone who’s worth anything will like the real Dave better than Dave the Chameleon anyway. Those who like Dave the Chameleon better can go find themselves someone else to be a chameleon. There doesn’t seem to be any shortage of people who are willing. But I think it’s rude to ask someone to change before you really get to know him or her, don’t you?

So I’ve been ignoring the site partly because when I’m paying attention to it, it’s really tempting to try to figure out what to write to make myself popular. And partly because it’s a distraction when I’m trying to figure out who I am. Writing is a big part of me, but it’s only part of me.

So I dug out some things I enjoyed in the past. I’ve been reading F. Scott Fitzgerald and listening to Peter Gabriel and U2 (early stuff, long before they got popular) and Tori Amos and Echo and the Bunnymen. The way I used to do things was to go look for stuff that most people overlooked, rather than letting current trends tell me what to like. So none of that’s cool anymore. Big deal.

The majority isn’t always right. Exhibit A: Disco.

I remember when I was in high school, either my freshman or sophomore year, a popular girl a year older than me came up to me and told me I needed to be more of a rebel. I thought about that and came to the conclusion that I was a rebel. She and her crowd were rebelling against authority figures. I was rebelling against conformity.

Oddly enough, I ended up sitting next to her boyfriend in Spanish class not long after that. We couldn’t stand each other at first, but then it turned out we had a lot more common ground than either one of us could have imagined and we became friends.

I can’t help but think of Fitzgerald. Fitzgerald was the spokesman of his generation, a generation not at all unlike ours, a generation that lived to excess and partied harder than any generation before, and up until GenX came along, or since. It’s obvious from Fitzgerald’s writing that he saw the excesses and even though it fascinated him, obviously there was a lot about it that he didn’t like. Yet his lifestyle didn’t change much. The result? The Voice of the Twenties was dead, aged 44, in 1940. Although some of his contemporaries recognized his greatness then, he was mostly remembered as a troublesome drunk.

Would Fitzgerald had lived longer if he’d been more of a rebel of a different sort? Well, I’d like to think so.

I’ve also been playing with computers. I pressed my dual Celeron back into duty and upgraded to the current version of Debian Unstable (I last did that sometime last summer, I think). It’s much, much faster now. I suspect it’s due to the use of GCC 3.2 or 3.3 instead of the old standby GCC 2.95. But I’m not sure. What I do know is the machine was really starting to feel sluggish, and now it feels fast again, almost like it felt to me when I first got it.

I’ve also been playing with PHP accelerators. I know I can only speed up a DSL-hosted site by so much, but my server serves up static pages much faster than my PHP pages, so I want that.

I’ve played around with WordPress a little bit more. It appears the new version will allow me to publish an IP address along with comments. I like that. I’m sick of rude people slinging mud from behind a wall of anonymity. I’m sure they’re much smarter than I am. So they ought to set up their own Web sites, so they can say whatever they want and enlighten the masses. If, as my most recent accuser says, what God wants is for Dave Farquhar and people like him to shut up, it won’t take much to drown my voice out.

OK, I’m done ranting. I’m gonna go in to work tomorrow and be my own person. I’m going to do what’s right, and not what’s popular, even when doing what’s right makes me unpopular. I’m going to stay focused and driven. The possibilities ahead are more important than the mistakes of the past and whatever happens to be missing from the present.

And there’ll be less missing with my vacationing coworkers back in the office.

And everything that’s true about work is true about life at home as well. Speaking of which, when I was out this weekend I noticed I was drawing second looks from girls again. Eating healthy again must be helping. That can’t be bad.

Well, this has to be the most disorganized and unfocused thing I’ve written in years. But I need to post something.

I’ll be back when my head’s more clear.

Optimizing a web server

Promises of better Apache performance have me lusting after lingerd, a very obscure utility that increases performance for dynamic content. It’s been used on a handful of little sites you might have heard of: Slashdot, Newsforge, and LiveJournal.
Unfortunately there’s no Debian package, which means compiling it myself, which means compiling Apache myself, which also means compiling PHP and MySQL, which means a big ol’ pain, but potentially better performance since I could go crazy on the GCC optimization flags. Hello, -O3 -march=i686!

And if I’m going to compile all that myself, I figure I might as well compile it all myself and get the high performance across the board and get GCC 3.2x into the picture for even better performance. The easy way to do that is with lfs-install, which builds a system based on Linux From Scratch. For workstations I’d rather use something along the lines of Gentoo, but for servers, LFS is small, mature, and reasonably conservative.

Supposedly metalog offers improved performance over the more traditional syslogd or sysklogd. The good news is that those who are more sane than me and sticking with Debian for everything can take advantage of a Debian package (at least in unstable), and just apt-get away.

If I have any sanity left, I’ll think about minit to replace SystemVInit and save me about 400K of memory in a process that’s always running, and fgetty to save me a little more. I’ve tried fgetty in the past without success; it turns out fgetty requires DJB’s checkpassword in order to work.

Keep in mind I haven’t tried any of this yet. But the plan sounds so good in my current sleep-deprived state I couldn’t help but share it.

More changes on the way

I’m playing with b2 0.61, which comes very close to feature parity with Movable Type. The only thing MT has that made me jealous that this newest version of b2 doesn’t is allowing multiple categories per post, which is something. I remember a Ted Williams tribute I wrote after he died that someone told me belongs in “human interest” rather than “baseball”–permitting multiple categories neatly fixes that problem.
But pingbacks and trackbacks are there, so I can interact with other blogs and they can interact with me, which is good.

On another front, I’ve managed to pull down all of my old content from editthispage.com (October 2000-April 2001), which Steve DeLassus and I are trying to massage into a form suitable for importing here. I did it in a very crude fashion–I set my display preferences to display 365 entries on the front page, then I downloaded my page with wget and I manually stripped out the obvious cruft. Very inelegant but it mostly works.

As for the custom b2 code Steve wrote, I’m getting closer to getting it into a form that’s distributable. The PHP calendar he modified for my use is history, replaced with one by Alex King that integrates more nicely with the rest of b2 and follows Mark Pilgrim’s accessibility guidelines nicely.

A reminder: 30 Days to a More Accessible Web Site

In a conversation today, I referred to Mark Pilgrim’s excellent 30 Days to a More Accessible Web Site.
This is must-read material. I confess to being guilty of neglecting most of the things in this piece, even though I would have gained substantial benefit from some of it at a recent point in my life, when I wasn’t able to operate a mouse and could barely keyboard.

I implemented the “add titles to links” feature. It required me to hack some PHP and is certainly the most substantial thing I’ve implemented without Steve’s help. It’s not much but it’s nice, even for those who have no disabilities–now, when you mouse over a calendar entry, the title of the entry pops up, like a tooltip. And for those using speech readers, now my calendar starts to make some sense.

As a bonus, some of this stuff will make Google treat you better if you implement it.

Read it. Download a copy and save it to your hard drive. And start implementing it.

Roll your own news aggregator in PHP

M.Kelley: I’m also wondering how hard would it be to pull a PHP/MySQL (or .Net like BH uses) tool to scrape the syndicated feeds off of websites and put together a dynamic, constantly updated website.
It’s almost trivial. So simple that I hesitate to even call it “programming.” And there’s no need for MySQL at all–it can be done with a tiny bit of PHP. Since it’s so simple, and potentially so useful, it’s a great first project in PHP.

It’s also terribly addictive–I quickly found myself assembling my favorite news sources and creating my own online newspaper. To a former newspaper editor (hey, they were student papers, but one of them was at Mizzou, and in my book, if you can be sued for libel and anyone will care, it counts), it’s great fun.

All you need is a little web space and a writable directory. If you administer your own Linux webserver, you’re golden. If you have a shell account on a Unix system somewhere, you’re golden.

First, grab ShowRDF.php by Ian Monroe, a simple GPL-licensed PHP script that does all the work of grabbing and decoding an RDF or RSS file. There are tons of tutorials online that tell you how to code your own solution to do this, but I like this one because you can pass options to it to limit the number of entries, and the length of time to cache the feed. Many RDF decoders fetch the file every time you call them, and some feeds impose a once-an-hour limit and yell at you (or just flat ban you) if you go over. Using existing code is a good way to get started; you can write your own decoder that works the way you want at some later date.

ShowRDF includes a PHP function called InsertRDF that uses the following syntax:
InsertRDF("feed URL", "name of file to cache to", TRUE, number of entries to show, number of seconds to cache feed);

Given that, here’s a simple PHP page that grabs my newsfeed:


<html><body>

<?php include("showrdf.php"); ?>

<?php

// Gimme 5 entries and update once an hour (3600 seconds)

InsertRDF("https://dfarq.homeip.net/b2rss.xml", "~/farquhar.cache", TRUE, 5, 3600);

?>

</body></html>

And that’s literally all there is to it. That’ll give you a very simple HTML page with a bulleted list of my five most recent entries. Unfortunately it gives you the entries in their entirety, but that’s b2’s fault, and my fault for not modifying it. I’ll be doing that soon.

You can see the script in action by copying and pasting it into your Web server. It’s not very impressive, but it also wasn’t any effort either.

You can pretty it up by making yourself a nice table, or you can grab a nice CSS layout from glish.com.

I can actually code tables without stealing even more code, so here’s an example of a fluid three-column layout using tables that’ll make a CSS advocate’s skin crawl. But this’ll get you started, even if that’s the only useful purpose it serves.


<html><body>

<?php include("showrdf.php"); ?>

<table width="99%" border="0" cellpadding="6">

<tr>

<td colspan="3" align="left">
<h1>My personal newspaper</h1>
</td>

</tr>

<tr>

<td width="25%">

<!--- This is the leftmost column's contents -->

<!--- Hey, how about a navigation bar? -->

<?php include("navigationbar.html"); ?>

</td>

<!--- Middle column -->

<td width="50%">

<p><h1>Dave Farquhar</h1></p>

<?php

// Gimme 5 entries and update once an hour (3600 seconds)

InsertRDF("https://dfarq.homeip.net/b2rss.xml", "~/farquhar.cache", TRUE, 5, 3600);

?>

</td>

<!--- Right sidebar column -->

<td width="25%">

<p><h2>Freshmeat</h2></p>

<?php

InsertRDF("http://www.freshmeat.net/backend/fm-releases-software.rdf", "~/fm.cache", TRUE, 10, 3600);

?>

<p><h2>Slashdot</h2></p>

<?php

InsertRDF("http://slashdot.org/developers.rdf", "~/slash.cache", TRUE, 10, 3600);

?>

</td>

</tr>

</table>

</body></html>

Pretty it up to suit your tastes by adding color elements to the <td> tags and using font tags. Better yet, use the knowledge you just gained to sprinkle PHP statements into a pleasing CSS layout you find somewhere.

Finding newsfeeds is easy. You can find everything you ever wanted and then some at Newsisfree.com.

Using something like this, you can create multiple pages, just like a newspaper, and put links to each of your files in a file called navigationbar.html. Every time you create a new page containing a set of feeds, link to it in navigationbar.html, and all of your other pages will reflect the change. This shows another nice, novel use of PHP’s niceties–managing things like navigation bars is one of the worst things about static HTML pages. PHP makes it very convenient.

A b2 user looks longingly at Movable Type

This web site is in crisis mode.
I’ve been talking the past few days with a lot of people about blogging systems. I’ve worked with a lot of them. Since 1999, I’ve gone from static pages to Manilla to Greymatter to b2, and now, I’m thinking about another move, this time to Movable Type.

At the time I made each move, each of the solutions I chose made sense.

I really liked Manilla’s calendar and I really liked having something take care of the content management for me. I moved to Greymatter from Manilla after editthispage.com had one too many service outages. (I didn’t like its slow speed either. But for what I was paying for it, I couldn’t exactly complain.) Greymatter did everything Manilla would do for me, and it usually did it faster and better.

Greymatter was abandoned right around the time I started using it. But at the time it was the market leader, as far as blogs you ran on your own servers went. I kept on using it for a good while because it was certainly good enough for what I wanted to do, and because it was super-easy to set up. I was too chicken at the time to try anything that would require PHP and MySQL, because at the time, setting up Apache, PHP and MySQL wasn’t exactly child’s play. (It’s still not quite child’s play but it’s a whole lot easier now than it used to be.)

Greymatter remained good enough until one of my posts here got a hundred or so responses. Posting comments to that post became unbearably slow.

So I switched to b2. Fundamentally, b2 was pretty good. Since it wasn’t serving up static pages it wasn’t as fast as Greymatter, but when it came to handling comments, it processed the 219th comment just as quickly as it processed the first. And having a database backend opened up all sorts of new possibilities, like the Top 10 lists on the sidebar (courtesy of Steve DeLassus). And b2 had all the basics right (and still does).

When I switched to b2, a handful of people were using a new package called Movable Type. But b2 had the ability to import a Greymatter site. And Movable Type was written in Perl, like Greymatter, and didn’t appear to use a database backend, so it didn’t appear to be a solution to my problem.

Today, Movable Type does use a MySQL backend. And Movable Type can do all sorts of cool stuff, like pingbacks, and referrer autolinks. Those are cool. If someone writes about something I write and they link to it, as soon as someone follows the link, the link appears at the bottom of my entry. Sure, comments accomplish much the same thing, but this builds community and it gives prolific blogs lots of Googlejuice.

And there’s a six-part series that tells how to use Movable Type to implement absolutely every good idea I’ve ever had about a Weblog but usually couldn’t figure out how to do. There are also some ideas there I never conceived of.

In some cases, b2 just doesn’t have the functionality. In some cases (like the linkbacks), it’s so easy to add to b2 even I can do it. In other cases, like assigning multiple categories to a post, it’s difficult. I don’t doubt b2 will eventually get most of this functionality. But when someone else has the momentum, what to do? Do I want to forever be playing catch-up?

And that’s my struggle. Changing tools is always at least a little bit painful, because links and bookmarks go dead. So I do it only when it’s overwhelmingly worthwhile.

Movable Type will allow you to put links to related entries automatically. Movable Type will help you build meaningful metatags so search engines know what to do with you (MSN had no idea what to do with me for the longest time–I re-coded my page design a couple of weeks ago just to accomodate them). MT will allow you to tell it how much to put into your RSS feed (which I’m sure will draw cheers from the poor folks who are currently pulling down the entire story all the time).

MT doesn’t have karma voting, like Greymatter did (and I had Steve add to b2). I like it but I can live without it. I can probably get the same functionality from page reads. Or I can just code up a “best of” page by hand, using page reads, feedback, and gut feeling as my criteria.

The skinny: I’m torn on whether I should migrate. I stand to gain an awful lot. The main reason I have to stay with what I have is Steve’s custom code, which he worked awfully hard to produce, and some of it gives functionality that MT doesn’t currently have. Then again, for all I know it might not be all that hard to adapt his code to work with MT.

I know Charlie thought long and hard about switching. He’s one of the people I’ve been talking with. And I suspected he would be the first to switch. The biggest surprise to me when he did was that it took him until past 3 p.m. today to do it.

And I can tell you this. If I were starting from scratch, I’d use Movable Type. I doubt I’d even look at anything else.

Increase the speed of your Web pages

There are commercial utilities that will optimize your HTML and your images, cutting the size down so your stuff loads faster and you save bandwidth. But I like free.
I found free.

Back in the day, I told you about two programs, one for Windows and one for Unix, that will crunch down your JPEGs by eliminating metadata that’s useless to Web browsers. The Unix program will also optimize the Huffman tables and optionally resample the JPEG into a lossier image, which can net you tremendous savings but might also lower image quality unacceptably.

Yesterday I stumbled across a program on Freshmeat that strips out extraneous whitespace from HTML and XML files called htmlcrunch. Optionally, it will also remove comments. The program works in DOS–including under a command prompt in Windows 9x/NT/2000/XP, and it knows how to handle long filenames–or Unix.

It’s not advertised as such, but I suspect it ought to also work on PHP and ASP files.

How much it will save you depends on your coding style, of course. If you tend to put each tag on one line with lots of pretty indentation like they teach in computer science classes, it will probably save you a ton. If you code HTML like me, it’ll save you somewhat less. If you use a WYSIWYG editor, it’ll probably save you a fair bit.

It works well in conjunction with other tools. If you use a WYSIWYG editor, I suggest you first run the code through HTML Tidy first. HTML Tidy, unlike htmlcrunch, actually interprets the HTML and removes some troublesome information. But in some cases, HTML Tidy will add characters, but this is usually a good thing–its changes improve browser compatibility. If you feed HTML Tidy a bunch of broken HTML, it’ll fix it for you.

You can further optimize your HTML with the help of a pair of Unix commands. But you run Windows? No sweat. You can grab native Windows command-line versions of a whole slew of Unix tools in one big Zip file here.

I’ve found that these HTML tools sometimes leave spaces between HTML elements under some circumstances. Whether this is intentional or a bug in the code, who knows. But it’s easy to fix with the Unix tr command:

tr "> indexopt.html

Some people believe that Web browsers parse 255-character lines faster than any other line length. I’ve never seen this demonstrated. And in my experience, any Web browser parses straight-up HTML plenty fast no matter what, unless you’re running a seriously, seriously underpowered machine, in which case optimizing the HTML isn’t going to make a whole lot of difference. Also in my experience, every browser I’ve looked at parses CSS entirely too slow. It takes most browsers longer to render this page than it takes for my server to send it over my pokey DSL line. I’ve tried mashing my stylesheets down and multiple 255-character lines versus no linebreaks whatsoever made little, if any, difference.

But if you want to try it yourself, pass your now-optimized HTML file(s) through the standard Unix fmt command, like so:

fmt -w 255 index.html > index255.html

Optimizing your HTML files to the extreme will take a little time, but it’s probably something you only have to do once, and your page visitors will thank you for it.

Well, that was unproductive

I didn’t do much yesterday but lay around and read ancient Dave Barry columns.
Well, I fixed a longstanding problem with Debby’s site (debugging PHP code is such a pain when the curly braces don’t all line up) and I returned a batch of car fuses I bought a couple of weeks ago that it turned out I didn’t need. Six bucks is six bucks, you know. That’s lunch for two days if I’m careful.

Other than that, I took a nap. Actually I might have dozed off twice. I don’t remember. It was a tough week. On Monday I got paged at 11:30 at night with a tape backup problem and ended up having to go in to work to fix it. Tuesday was quiet. Wednesday and Thursday I got paged late. How late, I don’t remember. But late enough that I’d gone to sleep. Wednesday’s problem I fixed remotely. Thursday’s problem might or might not have been fixable remotely, but the operator kept talking about blinking lights on the tape drive (it’s an internal drive, and I’m convinced he was referring to the blinking lights on the hard drives) and multiple blinking lights on a tape drive usually indicates big trouble, I ended up going in. I stumbled through the problem and finally went home. It wasn’t a hardware problem.

On Friday one of my coworkers took a digital picture of a tape drive for me so I can ask pointed questions about the blinky lights if when the problem comes up again. Looking at the picture, now I remember: blinky lights on the left is big trouble. Blinky lights on the right is highly unlikely, so I guess that’s even bigger trouble.

On my way home from Promise Keepers on Friday, I told my buddies I fully expected to get paged that night about a tape backup problem. They all thought that was pretty awful. I got home, plugged my work laptop in and booted it up, intending to be pre-emptive. I didn’t want to get paged at 1 in the morning and get told a 9:00 backup job failed–not when I needed to be at church at 7 in the morning with an 11-hour day (minimum) ahead of me. As I was firing up pcAnywhere, my phone rang. It was one of the operators, and a backup job had failed. I went in and fixed it.

But, seeing as I didn’t sleep more than six hours uninterrupted any night this week and operated two nights on four (I know, when parenthood comes I’m in trouble, but I really work best on 7 hours during the week and 8 on weekends), I slept in yesterday. I was late for church. The 10:45 service. Pathetic, I know. Then there was a post-service meeting I’d forgotten about. Oops. So I was there for two hours. They mercifully cut it off at two hours. I was out of fuel and was getting irate at the weird questions some people were starting to ask.

Then I came home, got irritated that my SWBell e-mail still isn’t working (six days and counting–makes me wonder if they’ve ditched their Sun equipment for Windows NT), tried to remember how to set up my own mail server again, decided it was too much thinking, and took a nap instead. That set the pace for the day.

I’m hoping this week isn’t a repeat performance of last week.

So you think Linux is unproven?

I’ve had arguments at work with one of the managers as to whether Linux is up to the task of running an enterprise-class Web server. When I mention my record with Linux running this site, the manager dismisses it, never mind that this site gets more traffic than a lot of the sites we run at work. So I went looking this afternoon for some sites that run on Linux, Apache, and PHP, like this one does.
I found a bunch of small-timers. Read more