Manas Tungare

URL Design Sins: 16 things that don't belong in URLs

(Because 16 is as good a number as any.)

Much has been said for a long time about making your URLs easy to use, remember, type, hack, and spread virally. There is still no dearth of ugly URLs all over the Web. A few very popular content management systems also engage in dirty URL practices, and it's a shame. To aid you in cleaning up your URLs, here's a list of specific things that do not belong in a URL.

  1. www. We've spent enough time with the World Wide Web to know that web pages reside on the WWW. Adding those four characters to the beginning of every single URL not only requires users to type them in every time, but also requires 4 extra bytes in every single database that stores URLs. Think for a moment how many bytes that would be. Get rid of them! And after you do that, make sure all your www. URLs redirect to the non-www. version.
  2. Port numbers. Unless your site is under test, there is no valid reason for hosting it on a non-default port (i.e., a port other than 80.) Apache on Mac OS X has a performance cache that runs on port 16080, and makes every URL of the form https://your-site.com:16080/. Unless you find a mechanism to run the performance cache on port 80, it is a good idea to dump the cache. It's not worth the confusing URL (to most users, if not to you.) Standard well-known port numbers are there for a reason.
  3. Index filenames. Filenames such as index.php and default.asp do not give us any more information than the rest of the URL. Drop them.
  4. Details of the server-side technology. Your users don't need to know what software you're running behind the scenes. They couldn't care less about whether your pages are .php, .jsp, .aspx or .do. It's best to configure your server to hide these extensions, and then make sure none of your URLs contain them.
  5. Special directories for special scripts. You no longer need to place your scripts in a cgi-bin. Get rid of that directory and any others like that. If your server requires you to do something like that, either find a way to configure it correctly, or upgrade to one that will let you do that.
  6. Document maintainers' names. Often, when each document has an assigned maintainer for some duration of time, those documents end up being in that particular person's web space. Later, when the maintainer moves on or someone else takes over the maintenance, you're left with a different URL than what you started with. To avoid this, it's best to categorize documents by topic and subject instead of under ~username/document.html.
  7. Internal database IDs. Sure, your content management system needs those IDs to locate your content, but your users don't need to know. If it takes an extra database lookup to get the ID from the URL, then so be it.
  8. CMS Module Names. Use a CMS that is intelligent enough to render a page without needing all sorts of information stored in the URL. Joomla is particularly notorious at this. What does this URL tell you about where it will take you?

    https://www.joomla.org/content/section/1/74/

    Now what if it were:

    https://joomla.org/news

  9. MiXeD-CaSe NaMeS. Don't confuse your users by-Mixing-Upper-case-and-Lower-case-Characters-in-the-URL. Stick to lower-case letters, and don't make them guess. If your user actually types in a URL in mixed case, normalize it on the server and serve the appropriate case.
  10. Random gunk. Unless you are a URL-compressor service such as Tiny URL or SnipURL, forget using random characters in your URL. Nobody wants to visit https://yourdomain.com/WijHyYQnVPWNs and guess what it might lead to.
  11. Session IDs. Make sure no user-session-specific identifiers end up in your URLs. This makes sure that users can pass on URLs to other users via email or IM, be able to bookmark them, and be sure that they represent a single resource. There are better places to keep session state in.
  12. Punctuation. Avoid punctuation that might make it difficult for people to tell others about your wonderful site over the phone. The only punctuation you may have is a hyphen ("-") and HTML entities that have special meaning (e.g. ?, #, :, + and @). No underscores, commas, periods, brackets, parentheses, braces, quotes, less-than, greater-than, equals, or pipes.
  13. Database query details. If your web pages have even a hint of database query language in the URLs, you should be on The Daily WTF.
  14. Repeated domain name. If the address of your web site looks like https://your-site.com/your-site/your-page.html, then you should have a chat with your web hosting provider about how to shorten it to https://your-site.com/your-page.html.
  15. Inconsistent naming. If you sell several products, then make the subdirectories below each product name exactly identical. If someone were to replace a product name by another, the rest of the URL structure should still continue to function. In other words, strive for consistency in naming.
  16. Missing content at each level. When a URL is several levels deep, users should be able to chop off parts at the end ("hack the URL") and still be able to get to a usable page. E.g. if you're a news site, and if an example URL looks like: https://my-news-site.com/2008/05/21/news-story.html, make sure you include a list of news articles from 21 May 2008 at https://my-news-site.com/2008/05/21/, and a list of links to daily articles for the entire month of May 2008 at https://my-news-site.com/2008/05/.

There are some easy technological solutions to make this work. Many of these do not require you to change the underlying file system structure or database structure.

But most of this comes with discipline: there is nothing here that is technology magic. It is just an application of common sense to a common domain (no pun intended.) Google mod_rewrite and content negotiation to get started.