There are commercial utilities that will optimize your HTML and your images, cutting the size down so your stuff loads faster and you save bandwidth. But I like free.
I found free.
Back in the day, I told you about two programs, one for Windows and one for Unix, that will crunch down your JPEGs by eliminating metadata that’s useless to Web browsers. The Unix program will also optimize the Huffman tables and optionally resample the JPEG into a lossier image, which can net you tremendous savings but might also lower image quality unacceptably.
Yesterday I stumbled across a program on Freshmeat that strips out extraneous whitespace from HTML and XML files called htmlcrunch. Optionally, it will also remove comments. The program works in DOS–including under a command prompt in Windows 9x/NT/2000/XP, and it knows how to handle long filenames–or Unix.
It’s not advertised as such, but I suspect it ought to also work on PHP and ASP files.
How much it will save you depends on your coding style, of course. If you tend to put each tag on one line with lots of pretty indentation like they teach in computer science classes, it will probably save you a ton. If you code HTML like me, it’ll save you somewhat less. If you use a WYSIWYG editor, it’ll probably save you a fair bit.
It works well in conjunction with other tools. If you use a WYSIWYG editor, I suggest you first run the code through HTML Tidy first. HTML Tidy, unlike htmlcrunch, actually interprets the HTML and removes some troublesome information. But in some cases, HTML Tidy will add characters, but this is usually a good thing–its changes improve browser compatibility. If you feed HTML Tidy a bunch of broken HTML, it’ll fix it for you.
You can further optimize your HTML with the help of a pair of Unix commands. But you run Windows? No sweat. You can grab native Windows command-line versions of a whole slew of Unix tools in one big Zip file here.
I’ve found that these HTML tools sometimes leave spaces between HTML elements under some circumstances. Whether this is intentional or a bug in the code, who knows. But it’s easy to fix with the Unix tr command:
tr "> indexopt.html
Some people believe that Web browsers parse 255-character lines faster than any other line length. I’ve never seen this demonstrated. And in my experience, any Web browser parses straight-up HTML plenty fast no matter what, unless you’re running a seriously, seriously underpowered machine, in which case optimizing the HTML isn’t going to make a whole lot of difference. Also in my experience, every browser I’ve looked at parses CSS entirely too slow. It takes most browsers longer to render this page than it takes for my server to send it over my pokey DSL line. I’ve tried mashing my stylesheets down and multiple 255-character lines versus no linebreaks whatsoever made little, if any, difference.
But if you want to try it yourself, pass your now-optimized HTML file(s) through the standard Unix fmt command, like so:
fmt -w 255 index.html > index255.html
Optimizing your HTML files to the extreme will take a little time, but it’s probably something you only have to do once, and your page visitors will thank you for it.
Another program to add to the Interesting DOS programs list 🙂
Trinidad and Tobago Computer Society at http://www.ttcsweb.org and http://www.ttcs.net
A small question, then a long diatribe. 🙂 Is htmlcrunch smart enough to keep internal spaces in empty XHTML tags? e.g.
That space is highly suggested to keep “some” browsers from getting lost.
To quote Dennis Miller, I may be getting off on a rant here, but when I see programs of this sort, I have to ask “how much benefit am I gaining vs. the cost of the effort?” It seems like using htmlcrunch is a very low effort exercise, so I wouldn’t lobby against using it. But unless you’ve got some nasty whitespace problems, I doubt it’ll help you much. (HTML with volumes of commentary are a different story.) Let me ‘splain.
HTML is text. Text is highly compressible. HTTP 1.1-compliant browsers will very nicely compress your HTML for its journey across the wire. Whitespace compresses especially well.
Unless you’ve got a vastly underpowered machine, raw parsing speed is a non-issue; it’s what the browser does in reaction to tokens that takes time. That’s a big distinction. I wouldn’t say that CSS *parses* slowly, but that browsers take some time setting up and applying styles.
The same applies to the server side of things – PHP, ASP, etc. Parsers zip right past (i.e. do not interpret) whitespace and comments because they’re not tokens. From peeking at the source, it looks like htmlcrunch is rather HTML grammar-specific, so I’m not sure what it’d do to a server script file. In any case, if you’re a freak of nature and verbosely comment your PHP :), I’d consider using the Apache mod that will pre-compile and cache your scripts in lieu of munging source code and keeping multiple copies around. Still, for moderately commented code, I doubt you’ll see much difference – we’re still talking about interpreted code here, just minus the tokenizing. (Disclaimer: I haven’t used this mod myself, but plan on trying it for grins soon.)
I think the core issue here is the perception of round-trip time. You make a page request, it goes over the wire, a server may execute a script and will hit memory and/or disk, data flows back over the wire, and your browser parses and renders it. The wire is orders of magnitude slower than memory and the primary bottleneck; disk access is relatively slow. But parsing the data isn’t, though *acting upon* the data – either in a browser or a server-side script – may be. Ultimately, having 500 extra bytes of (compressed) whitespace isn’t going to change a user’s perception of a three- to five-second round-trip time, IMHO.
Another windows based program to look at for optimizing HTML is Webtrimmer. Formerly shareware, now freeware, available at http://www.glostart.com/webtrimmer/webtrimmer.html.
Unlike HTML tidy, it doesn’t reparse any code. It removes whitespace, lengthens lines, and replaces tokens like   or " with their shortest possible equivalent. Then it goes through and removes do-nothing HTML commands, like empty tag pairs, or redundant tags (like specifying the font for each and every paragraph) of the sort that MS word and Frontpage love to put in.
The program has its drawbacks — it’s old and only knows about HTML 3.2, it doesn’t remove redundant code inside tables — but it can significantly reduce Frontpage bloat, for those who don’t have the time or ability to code pages by hand.
Anti-MS aside: A small online catalog I did for a client had its largest page at 30k when I was done hand coding it. Someone else uploaded the catalog, and that person stupidly ran the pages through MS Frontpage. The 30k page more than tripled in size, and the entire site gained hundreds of K of bloat. And in return for all that extra bandwidth, Frontpage took the aesthetic formatting I had very carefully tested on Netscape, Opera, and IE, and made it look like crap.
As to the question of how much reward such tweaking gives… there are still a lot of HTTP 1.0 browsers out there, and a lot of people using dialup (I’d still be using dialup myself if my partner’s company hadn’t offered to pay for our DSL so she can work from home – long story). I vividly remember watching the modem lights blink, drumming my fingers, checking that I had load images turned off, waiting for a simple HTML page to download. Anything that can reduce file sizes is a definite plus in the world of dialup.
: : TO ALL THOSE OF KNOWLEDGE : :
I have been reading posts on this site and others regarding optimizing HTML and images.
Like many others, programs like Frontpage have created a balloon of redundant tags on my pages. I am also certain most of my images can be reduced without loosing ANY quality.
Though I have knowledge of the problem, I do not have the expertise or confidence to accurately carry out the repair without causing new problems. Also, knowing what areas are necessary to address and to what extent, ie.. blank spaces, browser compatibility, white space, etc.
So in short, I am looking for either (and preferred) a user friendly windows application to address these issues, optimizing my code and images, as well as advice on what should or should not be done to the code…. or, (though not good for long run self maintenance) find one of you to optimize the code and images by your own means.
Thanks for your time and help.. the assistance is MUCH MUCH appreciated!
Please contact me by email.. email@example.com
please visit our site
Check out monitorcentral (GPL’d).
-w3c problems (using TIDY)
-multi-page cleaning (also using tidy)
-section 508 problems
-web page size totals (including scripts and pics)
-find broken links