MyBB Community Forums

Full Version: Bot/Crawler problem
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Anyone see these ip in their site? These bots are always on my site...

Guest IP: 38.99.44.103, 38.99.13.125

ISP of this IP [?]: Performance Systems International
Organization: Performance Systems International
Host of this IP: [?]: crawl-12.cuill.com [Whois]

So now i know that its cuil. i tried adding them in the acp>bots..but
they never show up as cuil..was wondering whats their user agent string?
thanks
define "cuil", and mybb doesnt have a function to find bot users agents
Cuil (pronounced "Cool") is a "new" search engine that was created by ex-Google employees. See http://www.cuil.com/

More information on the Cuil spider (known as Twiceler) can be found at http://www.cuil.com/info/webmaster_info/

The user agent is defined as:

Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
perfect..that worked...
thanks
Now i notice that Cuil is always online and is not listed as a guest.
Quote:2 users active in the past 15 minutes (1 member, 0 of whom are invisible, and 0 guests).
Cuil, continuum
I tried to prevent it from crawling by adding disallow in robots.txt for cuil. refreshed everything in cache manager... but this thing is always online...

any thoughts?
If you click details, maybe you can see what Cuil is doing?

Maybe it's reading your calendar week by week from 1970 until 2050...
(2008-12-18, 04:52 PM)frostschutz Wrote: [ -> ]If you click details, maybe you can see what Cuil is doing?

Maybe it's reading your calendar week by week from 1970 until 2050...

I can understand bots visiting once in awhile...but cuil is always present.
Because you probably created a new site, it's visiting frequently to index it as fast as it can (it needs to hold up to what they offer). However, if you denied them in your robots.txt file, give them 24 - 48 hours to actually take it in effect. Make sure you denied Twicler, as that's their bot.
Best Regards.
Funnily enough, Twiceler is one of the most annoying bots. Several friends with IP.Boards have had their Cuil spider online for weeks, and all it's doing is constantly reviewing threads, new information and eating bandwidth.

Basically it finds dynamic urls, then starts splicing them apart like there's no tomorrow. For example, domain.com/article/12/2008/i-want-rid-of-this-thing/. Twiceler will split the URL apart, looking at anything in the 2008 directory, then the 12 directory, then the article directory, then the root directory. If you have a busy site the amount of information it collects is unreal. So indexing your calendar probably isn't too far from the truth...

Twiceler will obey robots.txt, but you need to specify it, and even then the creators admit it will take 7 days for it take any notice of what you tell it to do.

User-agent: twiceler
Disallow: /
(2008-12-18, 10:49 PM)neoflight Wrote: [ -> ]I can understand bots visiting once in awhile...but cuil is always present.

...right. And if you clicked Details or Complete List or whatever under 'who's online' you'd at least get a general idea what it is reading / looking at. Not sure if there is a way to trace a user / bot in more detail. Drupal has such a tracing feature so they can ban people / bots / guests who cause too much load to the page (sometimes some user thinks it's a great idea to download a whole forum Rolleyes).
(2008-12-19, 12:09 AM)Tom.M Wrote: [ -> ]Funnily enough, Twiceler is one of the most annoying bots. Several friends with IP.Boards have had their Cuil spider online for weeks, and all it's doing is constantly reviewing threads, new information and eating bandwidth.

Basically it finds dynamic urls, then starts splicing them apart like there's no tomorrow. For example, domain.com/article/12/2008/i-want-rid-of-this-thing/. Twiceler will split the URL apart, looking at anything in the 2008 directory, then the 12 directory, then the article directory, then the root directory. If you have a busy site the amount of information it collects is unreal. So indexing your calendar probably isn't too far from the truth...

Twiceler will obey robots.txt, but you need to specify it, and even then the creators admit it will take 7 days for it take any notice of what you tell it to do.

User-agent: twiceler
Disallow: /

got that one. http://caejournal.com/robots.txt
yes it reads all pages and hangs out just reading people's profile, sending things Smile Smile
i think i will have to wait for it to obey my commands... strange don't you think?
Pages: 1 2