Networking Forums

Networking Forums > Computer Networking > Broadband > Blocking WebWise (Phorm) by User-Agent

Reply
Thread Tools Display Modes

Blocking WebWise (Phorm) by User-Agent

 
 
Chris Hills
Guest
Posts: n/a

 
      04-23-2009, 12:09 PM
Hi

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case. However, the BT WebWise guide at [1] says that their crawler obeys
entries for Yahoo's and Google's crawlers as well as "*", but do not
list their own crawler user agent. This means that in order to block
their crawler you would have to either block Google, Yahoo or both. Of
course I do not want to do that. Does anyone know what user-agent they
use? I made inquiry using the form on the webwise site but they refuse
to answer my question as I am not a BT customer. Alternatively, one
could redirect the crawler to a different robots.txt file, but to do
this one would need to know the ip address(es) from which the crawler
operates.

Regards,

Chris Hills

[1]
www2.bt.com/static/i/btretail/webwise/help.html#how-do-i-prevent-webwise-from-scanning-my-site
 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a

 
      04-23-2009, 02:29 PM
In article <gsplqc$lrd$(E-Mail Removed)>,
Chris Hills <(E-Mail Removed)> wrote:

>I operate a few websites and in addition to having my domains
>blacklisted by Phorm I want to exclude them using robots.txt just in
>case. However, the BT WebWise guide at [1] says that their crawler obeys
>entries for Yahoo's and Google's crawlers as well as "*", but do not
>list their own crawler user agent. This means that in order to block
>their crawler you would have to either block Google, Yahoo or both. Of
>course I do not want to do that. Does anyone know what user-agent they
>use? I made inquiry using the form on the webwise site but they refuse
>to answer my question as I am not a BT customer.


I also asked them about this, and have received only an automated
acknowledgment.

>Alternatively, one
>could redirect the crawler to a different robots.txt file, but to do
>this one would need to know the ip address(es) from which the crawler
>operates.


I did find a list of addresses to block, but I don't remember where.
It's somewhere on the web.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
 
Reply With Quote
 
Dave Saville
Guest
Posts: n/a

 
      04-24-2009, 07:29 AM
On Thu, 23 Apr 2009 12:09:48 UTC, Chris Hills <(E-Mail Removed)> wrote:

> Hi
>
> I operate a few websites and in addition to having my domains
> blacklisted by Phorm I want to exclude them using robots.txt just in
> case. However, the BT WebWise guide at [1] says that their crawler obeys
> entries for Yahoo's and Google's crawlers as well as "*", but do not
> list their own crawler user agent. This means that in order to block
> their crawler you would have to either block Google, Yahoo or both. Of
> course I do not want to do that. Does anyone know what user-agent they
> use? I made inquiry using the form on the webwise site but they refuse
> to answer my question as I am not a BT customer. Alternatively, one
> could redirect the crawler to a different robots.txt file, but to do
> this one would need to know the ip address(es) from which the crawler
> operates.


Surely *something* is going to show up in your web logs? Tip: When
processing log files, exclude what you know you don't want. Then
whatever is left is out of the ordinary. You can't program for what
you don't know is there (yet) :-)

--
Regards
Dave Saville

NB Remove nospam. for good email address
 
Reply With Quote
 
Chris Hills
Guest
Posts: n/a

 
      04-24-2009, 01:07 PM
On 24/04/09 09:29, Dave Saville wrote:
> Surely *something* is going to show up in your web logs? Tip: When
> processing log files, exclude what you know you don't want. Then
> whatever is left is out of the ordinary. You can't program for what
> you don't know is there (yet) :-)


Dave

When my sites get accessed the user agents are indeed logged. However,
once I know what the crawler agent is, it will be too late since it will
already have been crawled :-)

Regards,

Chris
 
Reply With Quote
 
Invalid
Guest
Posts: n/a

 
      04-24-2009, 04:05 PM
In message <gssdi0$sit$(E-Mail Removed)>, Chris Hills
<(E-Mail Removed)> writes
>On 24/04/09 09:29, Dave Saville wrote:
>> Surely *something* is going to show up in your web logs? Tip: When
>> processing log files, exclude what you know you don't want. Then
>> whatever is left is out of the ordinary. You can't program for what
>> you don't know is there (yet) :-)

>
>Dave
>
>When my sites get accessed the user agents are indeed logged. However,
>once I know what the crawler agent is, it will be too late since it
>will already have been crawled :-)
>
>Regards,
>
>Chris

Does Phorm crawl the sites? Will the traffic they profile show up in the
logs in any way which differs from the original requester?

AIUI Phorm's methodology is to look at the web pages individuals are
browsing in order to profile the individual not the website. It
identifies the individual from cookies set on the users machine, and
then passes on the original request to the website as if it came from
the individual.

Phorm say (see http://www.cl.cam.ac.uk/~rnc1/080518-phorm.pdf) that when
the website is first visited (by any ISP customer) the Robots.Txt file
is retrieved and cached (for a month). That implies you might see in the
log one request for Robots.txt once a month from a Phorm IP. The rest of
the traffic from your website that they profile will be unidentifiable
by you in any way.

The same document also suggests that they will only respect a
User-Agent: * construction and not one targeted at their Agent. See Para
44 "we work on the basis that if a site allows spidering of its
contents by search engines, then its material is being openly published.
Conversely, if the site has disallowed spidering and indexing by search
engines, we respect those restrictions in robots.txt". If they aren't
going to respect a User-Agent: Phorm (and why would they) then there is
no real point in knowing what the agent is really called anyway.

I suspect there is going to be no way a website can block Phorm while
allowing Google etc. to index it without someone resorting to the
courts.
--
Invalid
 
Reply With Quote
 
Denis McMahon
Guest
Posts: n/a

 
      04-26-2009, 07:33 AM
On Apr 23, 1:09*pm, Chris Hills <c...@chaz6.com> wrote:

> I operate a few websites and in addition to having my domains
> blacklisted by Phorm I want to exclude them using robots.txt just in
> case.


Surely the best solution is to use the apache configs or .htaccess
files to deny the ip ranges involved at every level of the websites
concerned?

Remember that for any crawler, robots.txt is an issue of good manners
and etiquette, not a set in stone must obey.

Denis
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a

 
      04-27-2009, 12:41 PM
In article <(E-Mail Removed) >,
Alex Fraser <(E-Mail Removed)> wrote:

>More simply, you can allow specific crawlers but disallow all others
>(including Webwise/Phorm), eg:
>
>User-agent: Google
>User-agent: Yahoo
>(etc)
>Disallow:
>
>User-agent: *
>Disallow: /


The BT website implies that if you allow Google or Yahoo, Phorm
will take that as permission.

If you know their IP addresses, you can block them at your firewall.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a

 
      04-27-2009, 10:21 PM
In article <(E-Mail Removed) >,
Alex Fraser <(E-Mail Removed)> wrote:

>> If you know their IP addresses, you can block them at your firewall.


>If you block them at your firewall, you may find them treating inability
>to fetch the robots.txt file the same as getting a 404 or a file with no
>instructions, ie they will inspect content.


I was planning to block the whole site to them, not just robots.txt.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a

 
      04-28-2009, 10:05 AM
In article <_eGdnSRcoqtCt2vUnZ2dnUVZ8o-(E-Mail Removed)>,
Alex Fraser <(E-Mail Removed)> wrote:

>I should have read the next question, which clearly says that if any of
>Google, Yahoo or "*" are disallowed, Phorm will not inspect the URL.
>Note that this contradicts what you say is implied.


I must have misread that. But I'm not convinced that it isn't some
weasel wording: "disallows any of these" could mean "doesn't allow
any", i.e. "diallows all". There's no reason to assume good faith on
their part. But in any case I don't want to disallow anything except
Phorm.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
 
Reply With Quote
 
Invalid
Guest
Posts: n/a

 
      04-28-2009, 10:39 AM
In message <gt5b4i$2nk5$(E-Mail Removed)>, Richard Tobin
<(E-Mail Removed)> writes
>In article <(E-Mail Removed) >,
>Alex Fraser <(E-Mail Removed)> wrote:
>
>>> If you know their IP addresses, you can block them at your firewall.

>
>>If you block them at your firewall, you may find them treating inability
>>to fetch the robots.txt file the same as getting a 404 or a file with no
>>instructions, ie they will inspect content.

>
>I was planning to block the whole site to them, not just robots.txt.
>
>-- Richard

I don't think there is any way you can.

As I said in an earlier post I don't think that Phorm actually fetches
the pages. The pages that they inspect are those fetched by the real
user - Phorm simply profiles a copy taken as they go back to the user
through the inspection routers at the users ISP.

The Robots.txt handling is a mechanism they use to identify URLs
requested by the user that don't allow them to profile the pages. The
first time they see an URL (from any ISP's user?) they check Robots.txt
to see if the site allows Google, Yahoo or any spiders to view it. If so
they cache that fact and subsequently transparently profile any pages
from that URL as they go past.

The IP address of the requests will be the IP address of the original
requestor. Pages being profiled by Phorm will (probably) be
indistinguishable from any other requests made to the site.

The only way to block Phorm profiling will be to disallow all indexing
of the site, disallow all users from ISP's who use the system (whether
or not the user has opted out - you can't tell) or to trust Phorm not to
profile your site after you have told them not to by e-mail.
--
Invalid
 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Furthher legal problems for Phorm (BT's Webwise) Eddie R Broadband 6 11-26-2008 09:43 PM
BT, Webwise and Phorm: A question of trust nospamthanks Broadband 12 11-19-2008 10:04 PM
Phorm and Webwise ? jdr.smith@virgin.net Broadband 28 03-26-2008 11:33 AM
blocking user via ip on msn msgr rouge6 Home Networking 0 02-28-2008 03:33 PM
OT: Agent user - converting Mailwasher to MW Pro? Terry Pinnell Broadband 4 05-22-2004 07:17 PM



1 2 3 4 5 6 7 8 9 10 11