Software
 
Search
 
Latest articles
Debian Squeeze RAID boot
Bromografo
Java
Linux RAID recovery
User-Agent recognition
HTML tooltip
Linux RAID-1 + GRUB

Java packages summary
 
Links
jtech
Download NetBeans!
 
logo unige
logo dist
 
© 2005 texSoft.it


P. IVA 01424860094


Privacy

1 User-Agent definition

This document tries to describe how to identify the HTTP user agent (browser or robot) examining the User-Agent string in the HTTP header of the request to the WEB server.

The User-Agent string format is described by RFC 1945 and RFC 2068 . Here are some excerpts from RFC 2068, please refer to the original RFCs for a complete reference:

Many HTTP/1.1 header field values consist of words separated by LWS
   or special characters. These special characters MUST be in a quoted
   string to be used within a parameter value.

          token          = 1*<any CHAR except CTLs or tspecials>

          tspecials      = "(" | ")" | "<" | ">" | "@"
                         | "," | ";" | ":" | "\" | <">
                         | "/" | "[" | "]" | "?" | "="
                         | "{" | "}" | SP | HT

   Comments can be included in some HTTP header fields by surrounding
   the comment text with parentheses. Comments are only allowed in
   fields containing "comment" as part of their field value definition.
   In all other fields, parentheses are considered part of the field
   value.

          comment        = "(" *( ctext | comment ) ")"
          ctext          = <any TEXT excluding "(" and ")">

3.8 Product Tokens

   Product tokens are used to allow communicating applications to
   identify themselves by software name and version. Most fields using
   product tokens also allow sub-products which form a significant part
   of the application to be listed, separated by whitespace. By
   convention, the products are listed in order of their significance
   for identifying the application.

          product         = token ["/" product-version]
          product-version = token

   Examples:

          User-Agent: CERN-LineMode/2.15 libwww/2.17b3
          Server: Apache/0.8.4

   Product tokens should be short and to the point -- use of them for
   advertising or other non-essential information is explicitly
   forbidden.  Although any token character may appear in a product-
   version, this token SHOULD only be used for a version identifier
   (i.e., successive versions of the same product SHOULD only differ in
   the product-version portion of the product value).

14.42 User-Agent

   The User-Agent request-header field contains information about the
   user agent originating the request. This is for statistical purposes,
   the tracing of protocol violations, and automated recognition of user
   agents for the sake of tailoring responses to avoid particular user
   agent limitations. User agents SHOULD include this field with
   requests. The field can contain multiple product tokens (section 3.8)
   and comments identifying the agent and any subproducts which form a
   significant part of the user agent. By convention, the product tokens
   are listed in order of their significance for identifying the
   application.

          User-Agent     = "User-Agent" ":" 1*( product | comment )

   Example:

          User-Agent: CERN-LineMode/2.15 libwww/2.17b3

Unfortunately the standard is not followed by some browsers, and often each browser uses its own format to compose the User-Agent string, resulting in a jungle of different formats. Moreover many browsers identify themselves as another browser, for compatibility reasons, and this makes it harder for a piece of software the recognition of the actual name and version of the agent.

The following links report useful information on User-Agent string format of the various browsers and robots:

User agent definition from Wikipedia, with many User-Agent string examples (http://en.wikipedia.org/wiki/User_agent)

Many details on User-Agent string formats, with history of the evolution of the various browsers' formats and many examples of real User-Agent strings for most popular browsers (http://www.hyperborea.org/journal/archives/2004/06/19/whats-in-a-user-agent-string/)

Very complete list of User-Agent strings (http://www.pgts.com.au/pgtsj/pgtsj0212d.html)


2 Why recognize User-Agent ?

In the early stages of the WWW, many browsers used incompatible HTML extensions (this led to some kind of browsers and formats war), and WEB designers often had to use in-page scripts to best adapt the page to the features offered by the browser requesting the page. Nowadays HTML has reached a solid standard to which most browsers adhere, and the need to adapt page's contents to the browser's type does not urge anymore, but it's still useful, for example for some different implementation of JavaScript.

The User-Agent recognition is also useful for statistical reasons, to test on the field the market share of the different browsers. This document is mainly focused on the recognition of User-Agent for statistical purposes.


3 Parse the User-Agent string

Among the thousands of different strings, it's possible to recognize a common format to parse them. It's not so simple then to identify the real name and version of the agent from the stream of tokens obtained by the parsing process.

Here are some examples of quite different strings:

Mozilla/4.7 [en] (WinNT; U)
Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0) Opera 5.11 [en]
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040612 Firefox/0.8
Mozilla/5.0 (compatible; Konqueror/3.2; Linux) (KHTML, like Gecko)
Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6h

Each string is built of a sequence of product/version couples and comment elements, even if in a slightly different format. The following POSIX regular expression can parse each block:

^([^/[:space:]]*)(/([^[:space:]]*))?([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?
[[:space:]]*(\\((([^()]|(\\([^()]*\\)))*)\\))?[[:space:]]*

Notice that this regular expression is less restrictive than the rules defined by RFCs, in order to better cope with non standard strings. The pattern tries to match a product[“/”product-version] and a comment following. Both are optional, so the expression can match only a product[/version], only a comment, or both. Here is the explanation of each fragment of the regular expression:

Regular expression

Note

^([^/[:space:]]*)

Product token (user agent's name), any sequence of character but / and white spaces. This can be an empty string, to handle single comment:


Mozilla/5.0 (compatible; Konqueror/3.2; Linux) (KHTML, like Gecko)


The second parenthesis block is matched as a single comment.

(/([^[:space:]]*))?

Optional user agent's version, follows the / after the product token.

([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?

Some old Netscape puts here language code within brackets; skips it as out of standard. For example:


Mozilla/4.7 [en] (WinNT; U)

[[:space:]]*

Eats spaces between agent's name/version and comment.

(\\((([^()]|(\\([^()]*\\)))*)\\))?

Optional comment within parenthesis. Allow one level of parenthesis inside the comment, like in:


Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.11)


If Perl 5.6 style regular expressions are used, the following recursively matches arbitrary deep levels of ( ) :


\( ( ( (?>[^()]+) | (?R) )* ) \)

[[:space:]]*

Eats trailing spaces.


The regular expression is applied to the User-Agent string many times, until the end of the string is reached; each pass gets one product/comment in the string. For example if it is applied to,

Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

it matches these:

Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10)
Gecko/20050716
Firefox/1.0.6

Here is an excerpt of PHP code that implements the parsing described above:

function extract_products_from_agent_string($agent)
{
  $found = array();
  $pattern  = "([^/[:space:]]*)" . "(/([^[:space:]]*))?";
  $pattern .= "([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?" . "[[:space:]]*";
  $pattern .= "(\\((([^()]|(\\([^()]*\\)))*)\\))?" . "[[:space:]]*";

  while( strlen($agent) > 0 )
  {
    if ($l = ereg($pattern, $agent, $a = array()))
    {
      array_push($found, array("product" => $a[1], "version" => $a[3], "comment" => $a[6]));
      $agent = substr($agent, $l);
    }
    else $agent = "";		// abort parsing, no match
  }

  return $found;
}

The function parses the User-Agent string passed as argument, and returns an array of associative arrays, each one defining a product/comment in the agent string (key of the associative array are product, version, comment).


4 Identify the product

Once the User-Agent string is parsed and we have the list of products/comments, more problems arise.

Very few agents define themselves with something like Real_product_name/Version; most, for historical compatibility reasons, begin their string declaring as “Mozilla/x.y”.

Some well-behaving agents define their real identity as the second (or third) product in the agent string, so it's enough to check for them. Others place the real product name into the comment of the first product, as Explorer or Konqueror for example:

Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Mozilla/4.0 (compatible; MSIE 6.0; MSN 2.5; Windows 98)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
Mozilla/5.0 (compatible; Konqueror/3.2; Linux) (KHTML, like Gecko)

As a general rule if the first product is defined as Mozilla/x.y and if the first element in the ; separated list in comment is compatible, then the real name and version of the product is in the second element of the list. Some browsers cloak themselves even deeper, look at this subtle string:

Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0) Opera 5.11 [en]

This version of Opera declares to be Mozilla, then Explorer and only at last itself (and using a not standard way do declare the version)! And how to recognize Mozilla itself, if everybody claims to be it? Not mentioning that even Netscape (that in early versions was internally codenamed Mozilla) declares as Mozilla, and only in latest versions it adds its real name as the last entry in the list of products...

The following algorithm can be used to successfully identify most of the user agents. Tested with the list of User-Agent strings from the first 2 links reported at the beginning of this document, could correctly identify all user agents and their version.


products = extract_products_from_agent_string(agent_string);

/* if a product in the list matches one of those that correctly declare themselves, returns it */
if (is_in_list_one_of(FIREFOX, NETSCAPE, SAFARI, CAMINO, MOSAIC, OPERA, GALEON))
{
	product = the_product_matching;
	version = version_component_of_product_token_if_any;

	/* if opera uses not standard format to declare version, matches it */
	if (product == OPERA) verify_if_string_match_pattern_like(“Opera 5.11 [en]”);
}

/* handles browsers declaring 'Mozilla compatible */
else if (first_product_in_list == MOZILLA and comment_begin_with(“compatible;”))
{
	check_for_cloaked_products(AVANT_BROWSER, CRAZY_BROWSER);
	product = second_entry_in_comment;
	version = second_entry_in_comment_after_space_or_/;
}

/* handles the real Mozilla (or old Netscape if version < 5.0) */
else if ( first_product_in_list == MOZILLA)
{
	if (product_version < 5.0)
	{
		product = NETSCAPE;
		version = version_component_of_product_token_if_any;
	}
	else
	{
		product = MOZILLA;
		version = get_mozilla_version_from_comment;
	}
}

/* if none of the above matches, uses first product token in list */
else
{
	product = first_product_in_list;
	version = version_component_of_product_token_if_any;
}


5 Identify the version

Well behaving user agents reports the version in their product token, following RFC definition product = token ["/" product-version]. Those that declare themselves as Mozilla/compatible usually reports their version after the product name, in the second entry of the ; separated list in the comment.

In Mozilla the version is reported in the comment string, as the last entry in the list, beginning with “rv”. Look here form more details on Mozilla's comment format: http://www.mozilla.org/build/revised-user-agent-strings.html.

Others use not standard way to report the version, like “Opera 5.11 [en]”. Sometimes however the version is completely omitted. The algorithm above can correctly identify the version of most common user agents.

Here are some examples, version is highlighted:

Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Mozilla/5.0 (compatible; Konqueror/3.1; Linux 2.4.22-10mdk; X11; i686; fr, fr_FR)
Mozilla/5.0 (X11; U; Linux i686; en-US; rv - 1.7.8) Gecko/20050511
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv - 1.7.8) Gecko/20050511 Firefox/1.0.4
Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv - 1.0.1) Gecko/20020920 Netscape/7.0
Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.03 [en]
Lynx/2.8.4rel.1 libwww-FM/2.14
Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)


6 Identify the Operating System

The place to look for the operating system of the user agent, is the first comment in the User-Agent string. Mozilla defines a well know format (see link above), and the name of the operating system is located in the 3rd element of the ; separated list. For compatible agents, the place to look for is the 3rd element too, with some exceptions:

Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90; MSN 6.0)

There “Win 9x 4.90” stands for Windows Me... but there is also Windows 98 declared! As there is not 100% warranty on the position of the real operating system, to avoid problems, the following algorithm can be used:

os_list = empty;
for each element into first_comment separated by “;”
{
	if (element begins_with(“win”) or 
		element contains(“linux”) or
		element contains(“Macintosh”, “Mac OS X”) or
		element contains(“FreeBSD”) or
		element contains(“NetBSD”) or
		element contains(“OpenBSD”) or
		element contains(“SunOS”) or
		element contains(“Amiga”) or
		element contains(“BeOS”) or
		element contains(“IRIX”) or
		element contains(“OS/2”, “Warp”) or 
	{
		add element to os_list;
	}
}

if (count(os_list) > 1)
	/** For win exclude “windows”, if present “Win 9x 4.90”, return it */
	os = get_relevant_element(os_list);
else
	os = os_list[0];


7 Brute force approach

If nothing else works, or to verify correct identification of user agents by the code, it can be useful to keep a database of known User-Agent strings, that is periodically updated on new visits. The database stores the full string and the real product, version and operating system. To identify the agent or to check the correctness of the result of the recognition code, just do a search for the string in the database.

This approach is easier, as it does not need any complex code, but it's more resource consuming, requiring a database, storage for table's data, and processor time to execute the query.

A very complete list of known User-Agent string with product name and version can be found on PGTS (http://www.pgts.com.au/) site here: http://www.pgts.com.au/download/data/browser_list.txt


8 PHP script

The rules described into this article has been used to develop a PHP script that can identify most User-Agent strings. Tested with the huge list (more than 13.000) of User-Agent string found here http://www.pgts.com.au/download/data/browser_list.txt, the script identified correctly most agents, reporting correct product, version and O.S.; only the very bogus strings could not be matched.