1 User-Agent definition
This document tries to describe how to identify
the HTTP user agent (browser or robot) examining the User-Agent
string in the HTTP header of the request to the WEB server.
The User-Agent string format is described by RFC
1945 and RFC 2068
. Here are some excerpts from RFC 2068, please refer to the original
RFCs for a complete reference:
Many HTTP/1.1 header field values consist of words separated by LWS
or special characters. These special characters MUST be in a quoted
string to be used within a parameter value.
token = 1*<any CHAR except CTLs or tspecials>
tspecials = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
Comments can be included in some HTTP header fields by surrounding
the comment text with parentheses. Comments are only allowed in
fields containing "comment" as part of their field value definition.
In all other fields, parentheses are considered part of the field
value.
comment = "(" *( ctext | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
3.8 Product Tokens
Product tokens are used to allow communicating applications to
identify themselves by software name and version. Most fields using
product tokens also allow sub-products which form a significant part
of the application to be listed, separated by whitespace. By
convention, the products are listed in order of their significance
for identifying the application.
product = token ["/" product-version]
product-version = token
Examples:
User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Server: Apache/0.8.4
Product tokens should be short and to the point -- use of them for
advertising or other non-essential information is explicitly
forbidden. Although any token character may appear in a product-
version, this token SHOULD only be used for a version identifier
(i.e., successive versions of the same product SHOULD only differ in
the product-version portion of the product value).
14.42 User-Agent
The User-Agent request-header field contains information about the
user agent originating the request. This is for statistical purposes,
the tracing of protocol violations, and automated recognition of user
agents for the sake of tailoring responses to avoid particular user
agent limitations. User agents SHOULD include this field with
requests. The field can contain multiple product tokens (section 3.8)
and comments identifying the agent and any subproducts which form a
significant part of the user agent. By convention, the product tokens
are listed in order of their significance for identifying the
application.
User-Agent = "User-Agent" ":" 1*( product | comment )
Example:
User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Unfortunately the standard is not followed by some
browsers, and often each browser uses its own format to compose the
User-Agent string, resulting in a jungle of different formats.
Moreover many browsers identify themselves as another browser, for
compatibility reasons, and this makes it harder for a piece of
software the recognition of the actual name and version of the agent.
The following links report useful information on
User-Agent string format of the various browsers and robots:
User agent definition from Wikipedia, with many
User-Agent string examples (http://en.wikipedia.org/wiki/User_agent)
Many details on User-Agent string formats, with
history of the evolution of the various browsers' formats and many
examples of real User-Agent strings for most popular browsers
(http://www.hyperborea.org/journal/archives/2004/06/19/whats-in-a-user-agent-string/)
Very complete list of User-Agent strings
(http://www.pgts.com.au/pgtsj/pgtsj0212d.html)
2 Why recognize User-Agent ?
In the early stages of the WWW, many browsers used
incompatible HTML extensions (this led to some kind of browsers and
formats war), and WEB designers often had to use in-page scripts to
best adapt the page to the features offered by the browser requesting
the page. Nowadays HTML has reached a solid standard to which most
browsers adhere, and the need to adapt page's contents to the
browser's type does not urge anymore, but it's still useful, for
example for some different implementation of JavaScript.
The User-Agent recognition is also useful for
statistical reasons, to test on the field the market share of the
different browsers. This document is mainly focused on the
recognition of User-Agent for statistical purposes.
3 Parse the User-Agent string
Among the thousands of different strings, it's
possible to recognize a common format to parse them. It's not so
simple then to identify the real name and version of the agent from the stream of tokens obtained by the
parsing process.
Here are some examples of quite different strings:
Mozilla/4.7 [en] (WinNT; U)
Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0) Opera 5.11 [en]
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040612 Firefox/0.8
Mozilla/5.0 (compatible; Konqueror/3.2; Linux) (KHTML, like Gecko)
Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6h
Each string is built of a sequence of
product/version couples and comment elements, even if in a slightly
different format. The following POSIX regular expression can parse
each block:
^([^/[:space:]]*)(/([^[:space:]]*))?([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?
[[:space:]]*(\\((([^()]|(\\([^()]*\\)))*)\\))?[[:space:]]*
Notice that this regular expression is less
restrictive than the rules defined by RFCs, in order to better cope
with non standard strings. The pattern tries to match a
product[“/”product-version]
and a comment
following. Both are
optional, so the expression can match only a product[/version], only
a comment, or both. Here is the explanation of each fragment of the
regular expression:
^([^/[:space:]]*)
|
Product token (user agent's name), any
sequence of character but / and
white spaces. This can be an empty string, to handle single
comment:
Mozilla/5.0 (compatible;
Konqueror/3.2; Linux) (KHTML, like Gecko)
The second parenthesis block is matched as a
single comment.
|
(/([^[:space:]]*))?
|
Optional user agent's version, follows the /
after the product token.
|
([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?
|
Some old Netscape puts here language code
within brackets; skips it as out of standard. For example:
Mozilla/4.7 [en] (WinNT;
U)
|
[[:space:]]*
|
Eats spaces between agent's name/version and
comment.
|
(\\((([^()]|(\\([^()]*\\)))*)\\))?
|
Optional comment within parenthesis. Allow one
level of parenthesis inside the comment, like in:
Mozilla/5.0 (X11; U;
Linux i686 (x86_64); en-US; rv:1.7.11)
If Perl 5.6 style regular expressions are
used, the following recursively matches arbitrary deep levels of
( ) :
\( ( ( (?>[^()]+) |
(?R) )* ) \)
|
[[:space:]]*
|
Eats trailing spaces.
|
The regular expression is applied to the
User-Agent string many times, until the end of the string is reached;
each pass gets one product/comment in the string. For example if it
is applied to,
Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
it matches these:
Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.7.10)
Gecko/20050716
Firefox/1.0.6
Here is an excerpt of PHP code that implements the
parsing described above:
function extract_products_from_agent_string($agent)
{
$found = array();
$pattern = "([^/[:space:]]*)" . "(/([^[:space:]]*))?";
$pattern .= "([[:space:]]*\[[a-zA-Z][a-zA-Z]\])?" . "[[:space:]]*";
$pattern .= "(\\((([^()]|(\\([^()]*\\)))*)\\))?" . "[[:space:]]*";
while( strlen($agent) > 0 )
{
if ($l = ereg($pattern, $agent, $a = array()))
{
array_push($found, array("product" => $a[1], "version" => $a[3], "comment" => $a[6]));
$agent = substr($agent, $l);
}
else $agent = ""; // abort parsing, no match
}
return $found;
}
The function parses the User-Agent string passed
as argument, and returns an array of associative arrays, each one
defining a product/comment in the agent string (key of the
associative array are product
, version
,
comment
).
4 Identify
the product
Once the User-Agent string is parsed and we have
the list of products/comments, more problems arise.
Very few agents define themselves with something
like Real_product_name/Version
; most,
for historical compatibility reasons, begin their string declaring as
“Mozilla/x.y
”.
Some well-behaving agents define their real
identity as the second (or third) product in the agent string, so
it's enough to check for them. Others place the real product name
into the comment of the first product, as Explorer or Konqueror for
example:
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Mozilla/4.0 (compatible; MSIE 6.0; MSN 2.5; Windows 98)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
Mozilla/5.0 (compatible; Konqueror/3.1; Linux)
Mozilla/5.0 (compatible; Konqueror/3.2; Linux) (KHTML, like Gecko)
As a general rule if the first product is defined
as Mozilla/x.y
and if the first element
in the ;
separated list in comment is
compatible
, then the real name and
version of the product is in the second element of the list. Some
browsers cloak themselves even deeper, look at this subtle string:
Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0) Opera 5.11 [en]
This version of Opera declares to be Mozilla, then
Explorer and only at last itself (and using a not standard way do
declare the version)! And how to recognize Mozilla itself, if
everybody claims to be it? Not mentioning that even Netscape (that in
early versions was internally codenamed Mozilla) declares as Mozilla,
and only in latest versions it adds its real name as the last entry
in the list of products...
The following algorithm can be used to
successfully identify most of the user agents. Tested with the list
of User-Agent strings from the first 2 links reported at the
beginning of this document, could correctly identify all user agents
and their version.
products = extract_products_from_agent_string(agent_string);
/* if a product in the list matches one of those that correctly declare themselves, returns it */
if (is_in_list_one_of(FIREFOX, NETSCAPE, SAFARI, CAMINO, MOSAIC, OPERA, GALEON))
{
product = the_product_matching;
version = version_component_of_product_token_if_any;
/* if opera uses not standard format to declare version, matches it */
if (product == OPERA) verify_if_string_match_pattern_like(“Opera 5.11 [en]”);
}
/* handles browsers declaring 'Mozilla compatible */
else if (first_product_in_list == MOZILLA and comment_begin_with(“compatible;”))
{
check_for_cloaked_products(AVANT_BROWSER, CRAZY_BROWSER);
product = second_entry_in_comment;
version = second_entry_in_comment_after_space_or_/;
}
/* handles the real Mozilla (or old Netscape if version < 5.0) */
else if ( first_product_in_list == MOZILLA)
{
if (product_version < 5.0)
{
product = NETSCAPE;
version = version_component_of_product_token_if_any;
}
else
{
product = MOZILLA;
version = get_mozilla_version_from_comment;
}
}
/* if none of the above matches, uses first product token in list */
else
{
product = first_product_in_list;
version = version_component_of_product_token_if_any;
}
5 Identify
the version
Well behaving user agents reports the version in
their product token, following RFC definition product
= token ["/" product-version]
. Those that
declare themselves as Mozilla/compatible usually reports their
version after the product name, in the second entry of the ;
separated list in the comment.
In Mozilla the version is reported in the comment
string, as the last entry in the list, beginning with “rv”
.
Look here form more details on Mozilla's comment format:
http://www.mozilla.org/build/revised-user-agent-strings.html.
Others use not standard way to report the version,
like “Opera 5.11 [en]
”.
Sometimes however the version is completely omitted. The algorithm
above can correctly identify the version of most common user agents.
Here are some examples, version is highlighted:
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Mozilla/5.0 (compatible; Konqueror/3.1; Linux 2.4.22-10mdk; X11; i686; fr, fr_FR)
Mozilla/5.0 (X11; U; Linux i686; en-US; rv - 1.7.8) Gecko/20050511
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv - 1.7.8) Gecko/20050511 Firefox/1.0.4
Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv - 1.0.1) Gecko/20020920 Netscape/7.0
Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.03 [en]
Lynx/2.8.4rel.1 libwww-FM/2.14
Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)
6 Identify the Operating System
The place to look for the operating system of the
user agent, is the first comment in the User-Agent string. Mozilla
defines a well know format (see link above), and the name of the
operating system is located in the 3rd element of the ;
separated list. For compatible agents, the place to look for is the
3rd element too, with some exceptions:
Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90; MSN 6.0)
There “Win 9x 4.90” stands for Windows
Me... but there is also Windows 98 declared! As there is not 100%
warranty on the position of the real operating system, to avoid
problems, the following algorithm can be used:
os_list = empty;
for each element into first_comment separated by “;”
{
if (element begins_with(“win”) or
element contains(“linux”) or
element contains(“Macintosh”, “Mac OS X”) or
element contains(“FreeBSD”) or
element contains(“NetBSD”) or
element contains(“OpenBSD”) or
element contains(“SunOS”) or
element contains(“Amiga”) or
element contains(“BeOS”) or
element contains(“IRIX”) or
element contains(“OS/2”, “Warp”) or
{
add element to os_list;
}
}
if (count(os_list) > 1)
/** For win exclude “windows”, if present “Win 9x 4.90”, return it */
os = get_relevant_element(os_list);
else
os = os_list[0];
7 Brute
force approach
If nothing else works, or to verify correct
identification of user agents by the code, it can be useful to keep a
database of known User-Agent strings, that is periodically updated on
new visits. The database stores the full string and the real product,
version and operating system. To identify the agent or to check the
correctness of the result of the recognition code, just do a search
for the string in the database.
This approach is easier, as it does not need any
complex code, but it's more resource consuming, requiring a database,
storage for table's data, and processor time to execute the query.
A very complete list of known User-Agent string
with product name and version can be found on PGTS
(http://www.pgts.com.au/) site
here: http://www.pgts.com.au/download/data/browser_list.txt
8 PHP
script
The rules described into this article has been
used to develop a PHP script that can identify most User-Agent
strings. Tested with the huge list (more than 13.000) of User-Agent
string found here
http://www.pgts.com.au/download/data/browser_list.txt,
the script identified correctly most agents, reporting correct
product, version and O.S.; only the very bogus strings could not be
matched.
Version: 1.0 Created: 2005-08-27 Modified: 2005-09-15
© Copyright 2005 texSoft.it
This document is distribuited under the
GNU Free Documentation License
This document is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
All trademarks in the document belong to their respective owners.