Tech Stuff - HTML vs XHTML vs HTML5, DOCTYPE, Accesibility

Some information about general page characteristics including HTML and XHTML and HTML5, DOCTYPE declarations and Acessibility - making our pages more readable for the visually impaired.

HTML vs HTML5 (with XHTML as a diversion)

It's not that we dislike HTML5, it's just that we felt betrayed when a cabal (as we saw it) hijacked the W3C efforts to standardize the web and essentially threw down the toilet a bunch of work that we had been doing to move our pages forward to XHTML (prior to the arrival of HTML5, this was the W3C recommended forward path). If you read the section below you will see that we were no big fans of XHTML either. But it was stable and we knew what to do to move forward so that every time we added or modified a page we could add the various escape sequences to get ready for XHTML. Then came the announcement of HTML5 and all our pages stopped validating, and there was no timescale for a HTML5 standard (still isn't) and everything was controversial and would take years in committee to agree. Like many other folks we use HTML, CSS and even (heaven forfend) Javascript simply to make our web sites. We regard this stuff (HTML, CSS etc.) like fancy word processing. We do it when we have nothing serious to do. So when something makes us do more work that we need to do we're extremely suspicious if not downright hostile. We remain convinced that when HTML5 stabilizes somewhere around the turn of the millenium it will have made a major contribution to life on the planet. Harumph. We started page validation again - reluctantly - after a 6 year hiatus.

XHTML vs HTML vs HTML5

Last time we read about this stuff was in 2002'ish and HTML was dead. HTML 4.01 was the end - the future was the stricter world of XHTML with XML - the Holy Grail of the markup languages - and we had all better get used to it. So while majority of our pages remained HTML 4.01, we started limbering up with a few XHTML 1.0 transitional pages using this definion:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

At the same time we started to use lower case tags everywhere and got into the habit of closing non-closed tags with /> kluges like <br />, <img /> even in HTML 4.01 pages. Trying to adopt good habits defined by the W3C XHTML documentation slowly and incrementally. Whether we liked it or not. All pain for no demonstrable gain. But strict man, really strict rules.

Now (mid 2007) it seems that HTML is alive and well with HTML5 being worked on (somewhat contentiously) by the W3C as an incremental approach - though perhaps not reaching standard status until 2010 or later. HTML5 apparently is trying to reconcile HTML with XHTML as well as trying to provide a pragmatic evolution. There are some proposed new tags and features which look kinda interesting but the key issue is that instead of threatening complete page failure if you get something wrong (which always scared the devil out of us) HTML5 focuses on a standard way of rendering bad pages - taming the tag soup rather than punishing it with a blank page.

So we are dropping all our XHTML DOCTYPES and reverting to HTML 4.01 transitional. Without any guilt. Instead perhaps we'll focus on trying to get rid of those align, target, width, border attributes and fix the google advertising markup that prevents our pages from validating as strict HTML 4.01. Or perhaps we'll do nothing until the W3C publishes the draft paper defining the changes from HTML4 to HTML5 that they have so far refused to do. No way we're going down a blind-alley again.

Just in passing we note that in an age of tabbed browsers (even MSIE) we think the old objections to the target attribute are moot - and besides too many pages just use javascript to achieve the same effect while claiming validation. Cynical cheating of the worst kind. IOHO.

And if HTML5 goes off in some bizarre direction like banning all /> closures (like this validator does) - we'll just ignore it like we did (mostly) with XHTML. There is too much HTML out there and sadly too many pages without any DOCTYPE (according to this research only 40% of pages have a DOCTYPE) and only 2.6% of all pages are valid (X)HTML.

And in the meantime due to the lack of standards development in the HTML world we have emerging movements such as Microformats. Not sure if these are aberant or useful.

DOCTYPE Declarations

Seems a simple enough task. Click the link on the right to the W3C's DOCTYPE recommendations. Scan the list, copy-and-paste into your web page and that's that.

So you're being good and you use an XHTML DOCTYPE and get warnings about character sets from the W3C's page validator. But you get the green bar so you ignore them. Then you hit a page that uses strange characters and this time the characters set message really hurts - no green bar. So you delete the first <?xml line before the DOCTYPE and the problem goes away and you get the green bar again. But wait a minute - why was it there in the first place and are there consequences if it's deleted and if so what are those consequences? Seems like you need to know stuff not just blindly cut and paste. And then you stumble across this page and it gets more confusing - because the recommendations are very different. Just what's a working stiff supposed to do?

Here are the four main (non-frameset) DOCTYPES and our view of them:

# for masochists or those using CMS systems

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

# most of us can get away with this one as long as you are 
# into /> kluges for br, hr, img, meta etc.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

# apart from align, border, width, target attributes and google adsense
# code this one is achievable

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">

# ritual suicide may be best if you can't get here
# apart from the occasional page bug that is

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

Content-Type

First, DOCTYPE is only one part of the puzzle, the meta tag Content-Type within the <head></head> section also plays a significant role. An example of the tag is shown below.

<meta http-equiv="Content-Type" content="text/html;">

text/html	Irrespective of the DOCTYPE these pages use the normal tag soup HTML parser with a few extra rules thrown in to make life thoroughly miserable. Even if your DOCTYPE is XHTML - no XML treatment.
application/xhtml+xml	Or application/xml. With an XHTML DOCTYPE you get XML treatment.

Charset

Every time we see the word charset we get a headache. Our pages historically all have this charset:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

Charset=windows-1252 defines Microsoft's CP 1252 character set and is widely supported. It differs from ISO 8859-1 (and -15) by using an additional 27 characters. But the <?xml definition uses utf-8, hence the W3C's gentle (mostly) warning when it sees both. UTF-8 is an encoding scheme and implies the Unicode character set. So the fix is either delete the <?xml line, change the charset to read:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Or delete the charset value (since it is already defined in the <?xml line in this case).

Note: If a browser receives a document without any charset value (and no <?xml value) it may default to CP-1252 (MSIE) or ISO 8859-1, which is the default HTTP procotcol character set, but it may also guess the character set based on any meta http-equiv="Content-Language" value present or attempt to sniff the document. It is an extremely good idea to have either a charset or a language and preferably both in all pages. Charsets supported by Microsoft's IE. Firefox's default charset is ISO-8859-1. To get a full list of supported values use Menu->Tools->Options. Select Content Tab, Fonts & Colors, click Advanced button, Character Encoding combo list defines supported values. Opera 9 works in Unicode (UTF-16) by default but attempts to sniff the page character encoding if none is present and supports multiple Legacy encodings which include windows-1252, ISO 8859-1 etc..

Accessibility

There is a lot of opinion based useability and accessibility information which we tend to avoid like the plague. There are links on the right of this page to research based material. What about good old fashioned predominantly text sites, like ours, and how to make them more accessible? Here is what we have done:

Page Structure: We lay our pages out so that the central (content) pane loads before the side menus. This should allow screen readers to get to the meat of the content quicker. Though we still have concerns that the banner section with a ton of drop-down menu HTML loads first.
Aural Stylesheet: We are considering adding an aural style-sheet though the W3C's accessibilty page does not even mention this but would need to fix the menu problem (see previous point) which is pretty structural. If there had been some decent advice when we built the pages we coud have avoided the problem. Finally, we note that there is no button or similar visual clue that indicates an aural style sheet exists. We also see very little difference between a print stylesheet and an aural one since they both should remove non-essential clutter.
Character Sizes: Graphic designers want to control the layout of their pages for aesthetic reasons. This means using pt or px sizes for font rendering. Which can work against the user's browser selection in many cases. We have recently moved to using em sizes in the central pane - the content - while continuing to use pt sizes in the left and right hand menus due to space constraints. The theory being that em is defined in relation to the user's selected point size, rather than being an absolute value.
We have changed our line spacing (leading) in the central content pane from the browsers default (1.1'ish of point size) to 1.3em to increase readibility by adding more whitespace.
We have increased our central column margins from 10 to 20px to improve readability and reduce the amount of text per line.

Readability Research

There is less of this than one might expect. Here are the general conclusions that we found and have - generally - adopted:

Readers prefer black on white text and in general any dark on light is preferred to inverse text - light text on a dark background, for example, white text on a black background.
Sans-serif fonts for screen reading. Serif for printing.
12 point Arial is preferred to 12pt Verdana but 9/10pt Verdana is preferred to 9/10pt Arial.
White space is good. And more is better.

Yeah, that's all we found.

Problems, comments, suggestions, corrections (including broken links) or something to add? Please take the time from a busy life to 'mail us' (at top of screen), the webmaster (below) or info-support at zytrax. You will have a warm inner glow for the rest of the day.