Regular expression example: IP location
On the previous page, we showed how a regular expression can be uesd to
extract the country code from the referrer string,
looking at the simplest case of a Yahoo referrer string.
Parsing the Google referrer string
Recall that the Google referrer strings look as follows:
http://www.google.com/search?hl=fr&q=dictionary+french
http://www.google.co.in/search?hl=en&q=java+programming
http://www.google.com.au/search?hl=en&q=sidney+shopping
http://www.google.bg/search?hl=bg&q=red+wine
As you can see:
- in many cases these referrer strings contain a language code
as a parameter which we could additionally use;
- even if the country-neutral google.com domain is used, a language
code can still be specified;
- in country-specific domains, before the two-digit country code suffix of the domain, there
can be an additional suffix (.com or .co).
We'll propose treating these referrers as two types of case:
- if the domain is google.com, we'll look for a language code (the hl parameter)
and use it as a clue to location;
- otherwise, we'll extract the two-letter country code from the end of the domain.
Of course, this isn't perfect. For example, there are many Spanish speakers living in the
southern states of the US who may well use google.com but have the language configured to
be es. With our simplistic method here, we'd mistakenly say they were in Spain.
In the first URL above, we will say the user is in France on the basis of the language code fr,
but they could quite likely be in Canada.
And ultimately there's nothing to stop a user in Spanish-speaking Peru using
the Australian site google.com.au and configuring their language to be Italian.
Slightly erroneously, we're pretending that country and language codes are
the same thing; in some cases this isn't true, and in some cases a language can be specified
with a locational variant (e.g. fr-CA for Canadian French) which would be a better clue.
We'll ignore these issues here. It actually turns out that in many cases, the simplistic methodology we outline
here is a reasonable first approximation.
On the next page, we consider in turn these two types of case: google.com with a language code and country-specific google domain.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.