Results 1 to 10 of 10
  1. #1
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts
    (Oh, this is in response to Post #804497)

    Let's say hypothetically that we wanted to open up 2 forums a week to indexing. It would be easy if the Lounge's URLs actually contained the name of the individual forum, e.g.

    lounge.windowssecrets.com/Spreadsheets/
    lounge.windowssecrets.com/SuggestionBox/

    But alas, it is not so. Almost every URL consists of index.php followed by various parameters. This site seems to have thought about how to exclude robots from indexing certain links, and may provide a helpful model for a transition strategy: IPB Robots File Example. Google apparently lets you test the consequences of your robots.txt file: Block or remove pages using a robots.txt file - Webmasters/Site owners Help.

    Ultimately, we might need to use some very arbitrary approach, such as the following:

    Code:
    Disallow: /index.php?showtopic=2
    Disallow: /index.php?showtopic=3
    Disallow: /index.php?showtopic=4
    Disallow: /index.php?showtopic=5
    Disallow: /index.php?showtopic=6
    Disallow: /index.php?showtopic=7
    Disallow: /index.php?showtopic=8
    Disallow: /index.php?showtopic=9
    Thus, the robot would find and be free to index threads whose numbers starting with 1 (1, 100-199, 100000-1999999, etc.). At appropriate intervals, the disallow directives could be deleted to open up the next group of threads.

    Apologies for any syntax errors, I haven't tested the above myself.

  2. #2
    Brian Livingston
    Guest
    You are correct about the gradual nature of the change we need to make. Our basic thinking here in Seattle about revealing Lounge pages to search engines is that everything should be released over a period of approximately 30 days. Numerous articles on search-engine optimization indicate that it's a bad idea to reveal 100,000 new pages to Google all at once. The search giant's algorithms tend to ban sites that suddenly increase in size, assuming these sites to be spam. Discussions in SEO forums lead me to believe that it would be reasonable for us to reveal 5,000 pages per day to Google and other search engines, however.

    The following lines currently appear in the Lounge's Robots.txt file (all search engines are blocked, and the Googlebot specifically, because crawlers previously crashed the Lounge's underpowered server):

    Code:
    User-agent: *
    Disallow: /
    
    User-agent: Googlebot
    Disallow: /
    The Lounge has more than 700,000 posts, which are present in approximately 125,000 threads. With an average ratio of 6 posts per thread, revealing 20,000 posts per day to search engines should result in about 3,333 pages per day being revealed. That would unveil the entire Lounge in 30 days or so. This should reveal the entire Lounge to crawlers by Dec. 31, 2009.

    My research on Robots.txt suggests that the following pattern would "Disallow" indexing of all existing threads. We would then edit two lines each day to say "Allow" rather than "Disallow." We would start by revealing the newest content and move back in time to reveal older content. Search engines would see and index parts of the database gradually, as though the site were adding a few thousand pages each day, like Digg.com and other large sites. Specifically, we would change the 81* and 80* lines on Day 1, change the 79* and 78* lines on Day 2, and so forth.

    Code:
    User-agent: * 
    Allow: lounge.windowssecrets.com/index.php/?showtopic=99* 
    Allow: lounge.windowssecrets.com/index.php/?showtopic=98* 
    ... 
    Disallow: lounge.windowssecrets.com/index.php/?showtopic=81* 
    Disallow: lounge.windowssecrets.com/index.php/?showtopic=80* 
    Disallow: lounge.windowssecrets.com/index.php/?showtopic=79* 
    Disallow: lounge.windowssecrets.com/index.php/?showtopic=78* 
    Disallow: lounge.windowssecrets.com/index.php/?showtopic=77* 
    ...
    One disadvantage of this method is that changing the showtopic=81* line to "Allow" from "Disallow" would permit search engines to index not only threads with a starting post of 810000 to 819999 but also posts like 810, 8100, and so forth. Because not every post is the beginning of a thread, however, this small amount of imprecision shouldn't make any significant difference in the number of pages we reveal each day.

    The day after every "Disallow" line has been changed to "Allow," the "Allow" lines can all be deleted. The Robots.txt file can then return to a very simple state — perhaps excluding only our search-results page, image directories and the like.

    The following Web pages explain how Google interprets the Robots.txt directives I've shown above, and how various search engines do and do not observe Robots.txt conventions. For example, some smaller search engines do not understand the asterisk operator except to indicate file types, such as *pdf$:

    http://www.google.co...n&answer=156449

    http://www.serbanghi...ent-issues.html

    One final complication is that our Web developer and Admin Kurt Naber has found that we can permanently get rid of the index.php/?showtopic= part of Lounge URLs in the next few days. We will take this into consideration before altering the Robots.txt file. I'll write in a few days about our progress on making shorter URLs. Thanks.

  3. #3
    Brian Livingston
    Guest
    Admin Tony Johnston and I are just about ready to open the Lounge to search engine indexing on Dec. 5. Our goal is to see that all Lounge pages are in every major index by approximately Dec. 31.

    One problem with using robots.txt (the name must be lowercase, according to some sources, notwithstanding what I wrote previously) is that Invision Power Board does not use a directory structure as such. Instead, IPB uses query strings appended to index.php, the main script file. The link to our FAQ post on how to use Bro.ws links, for example, looks as follows (notice no subdirectories under index.php):

    Code:
    http://lounge.windowssecrets.com/index.php?showtopic=768802
    Usually, in a robots.txt file, you would exclude directories that contain code, archives, images, and other files that it would be a waste of a search engine's time and your site's bandwidth to index. The following lines in robots.txt would accomplish such an exclusion:

    Code:
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Originally, we had planned to make Google and other search engines index the Lounge's content over a period of 30 days by excluding 100 strings on the first day, 97 strings on the second day, and so forth. But not all search engines support wildcards in directory names in robots.txt.

    Instead, we're now planning to use the crawl-delay feature of robots.txt to limit search engines to indexing 1,440 pages per day. If this is successful, we'll reduce the delay so crawlers can index 2,880 pages per day, then 5,760 per day, and so forth. The issue is not bandwidth — when more than 100 members are signed in, the current server's CPU utilization is only 5% to 10% — but the fact that Google sometimes bans sites that suddenly reveal 100,000 new pages.

    The following text represents our latest draft of the robots.txt file that we plan to upload to the site. The lines beginning with hash marks (#) are comments. Please feel free to reply in this thread if you have any additional information to provide on robots.txt. Thanks.

    Code:
    # robots.txt file for lounge.windowssecrets.com - Draft 2009-12-04
    #
    # "Disallow" lines prevent indexing the following non-content sections:
    #
    # Showuser
    # Showforum (no useful content on forum pages)
    # Search entry and search results
    # Top posters
    # Sign in, Delete cookies (both use "section=login")
    # Register
    # Mark as read
    #
    # Not all robots honor "Allow" directives, but Google and some others do.
    # Crawl-delay: 60 seconds between page requests equals 1,440 pages per day.
    # We will reduce crawl-delay after we judge the impact of search engines.
    # Sources differ on whether Google honors "Crawl-delay"; we'll find out.
    
    User-agent: * 
    Allow: /
    Disallow: /index.php?showuser*
    Disallow: /index.php?showforum*
    Disallow: /index.php?app=core&module=search*
    Disallow: /index.php?app=members&section=view&module=list*
    Disallow: /index.php?app=core&module=global&section=login*
    Disallow: /index.php?app=core&module=global&section=register*
    Disallow: /index.php?app=forums&module=forums&section=markasread*
    Crawl-delay: 60
    
    User-agent: Googlebot
    Allow: /
    Disallow: /index.php?showuser*
    Disallow: /index.php?showforum*
    Disallow: /index.php?app=core&module=search*
    Disallow: /index.php?app=members&section=view&module=list*
    Disallow: /index.php?app=core&module=global&section=login*
    Disallow: /index.php?app=core&module=global&section=register*
    Disallow: /index.php?app=forums&module=forums&section=markasread*
    Crawl-delay: 60
    
    # Prevent Google from indexing images (until we decide otherwise)
    User-agent: Googlebot-Image
    Disallow: /
    
    # Permit AdSense to be served, if desired
    User-agent: Mediapartners-Google
    Allow: /
    
    # Sources:
    # http://www.google.com/support/webmas...&answer=156449
    # http://en.wikipedia.org/wiki/Robots.txt
    # http://www.serbanghita.com/search-en...nt-issues.html
    # http://www.mcanerin.com/EN/search-engine/robots-txt.asp

  4. #4
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post
    Looks OK Brian,

    One comment though that I am sure you have considered and that is:

    /index.php?module=extras&do=leaders*

    This link is available even if not logged in and it could open Moderators and Admin profiles to the robots. Whilst I am sure they are happy to have their information open to users I am just making you aware.
    Jerry

  5. #5
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post
    In addition to my suggestion above, one thing that may be useful for the WS Lounge is to have the Active Posts visited if the bot hits the site and get the most recent activity at the time of the visit and have some more relevant information displayed in Google however to do this it would cause a conflict on one of the Disallow statements:

    Code:
    ...
    Disallow: /index.php?app=core&module=search*
    
    ...
    It could work with an Allow and Disallow statement


    Code:
    Allow: /index.php?app=core&module=search&do=new_posts
    Disallow: /index.php?app=core&module=search*
    Your thoughts...
    Jerry

  6. #6
    Brian Livingston
    Guest
    Jezza, I think you have excellent suggestions, and we will add them to the robots.txt file before launching. As you can see, this file is complex and poorly supported, which is why it's taking us a bit longer than Dec. 3 to get it to the point where we can trust our algorithm.

    For example, "Allow:" is supported by Google but not every other search engine. I've having a bit of difficulty finding an authoritative source on exactly what commands are supported by which engines. Also, asterisk (*) works differently or doesn't work at all by some crawlers.

    Finally, I'm getting a bit concerned about Google not supporting "crawl-limit" (according to some sources. Google's own documentation says nothing about supporting this command. We may have to use 100 different lines (99*, 98*, etc.) in the Googlebot section. I'll let you know here how this turns out. Thanks for your help.

  7. #7
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post
    Thanks Brian

    I have been doing my homework and at the moment I have got a list of 303 robots, I think that it may be more useful if we can decide which bots you want to Allow and then we can work on the various combinations, in some cases you could set the robot.txt to only allow certain types of bot in at particular times of the "server's day" and then at least we can see who they are.
    Jerry

  8. #8
    Brian Livingston
    Guest
    The conclusion to our research seems to be that Google does not honor the crawl-delay directive in robots.txt files. Therefore, we apparently need to block Googlebot from specific ranges of threads in order to reveal only about 5,000 pages per day. We regard this as a safe rate of "new pages" that will not get the Lounge or WS banned as a "spammy" site.

    The newest thread, as I write this, is numbered around 770000. A directive such as:

    Code:
    Disallow: /index.php?showtopic=21*
    selects posts with numbers like: 210000, 21000, 2100, 210, and 21. The range 21*, therefore, theoretically includes 11,111 posts. Deleted and hidden posts, however, won't be indexed and won't be part of the count. Because there are an average of 6 posts per thread, 11,111 posts represent approximately 1,850 pages (ignoring the fact that a thread might be displayed across more than one page).

    On Day 1, we wish to allow Google to index our newest threads first, namely 760000 and higher, and block Google from indexing threads 759999 and below. To do this, I've found, only 12 lines of code must be added to the draft robots.txt file shown above in Comment #3:

    Code:
    Disallow: /index.php?showtopic=75*
    Disallow: /index.php?showtopic=74*
    Disallow: /index.php?showtopic=73*
    Disallow: /index.php?showtopic=72*
    Disallow: /index.php?showtopic=71*
    Disallow: /index.php?showtopic=70*
    Disallow: /index.php?showtopic=6*
    Disallow: /index.php?showtopic=5*
    Disallow: /index.php?showtopic=4*
    Disallow: /index.php?showtopic=3*
    Disallow: /index.php?showtopic=2*
    Disallow: /index.php?showtopic=1*
    We will then delete 3 "disallow" lines each day — for example, 75* and 74* and 73* — to reveal about 5,000 new pages to Google. When we get to 6*, we'll expand this to 69* and 68* and so on to allow us to delete 3 lines in that range, etc.

    The robots.txt file itself has grown too long and wide (with comments added to several lines to document them) to paste in here. For this reason, I've made the current build of the file an attachment:

    [attachment=86969:robots.txt]

    Anyone can view the current state of robots.txt in real time as the Admins revise it over the next 30 days. To do this, enter into your address bar Lounge.WindowsSecrets.com plus /robots.txt. The file is readable by the public and must be readable for crawlers to see it. As you'll see, I like to use approximately one line of documentation for each line of code. In this case, the internal documentation may make our use of robots.txt more understandable for other webmasters who are researching their options.

    Thanks to everyone for your support.
    Attached Files Attached Files

  9. #9
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Brian Livingston View Post
    The conclusion to our research seems to be that Google does not honor the crawl-delay directive in robots.txt files.

    [snip]
    That is interesting, I was always led to believe that you can control the crawl-rate of the Googlebot for sites at Root Level i.e. lounge.windowssecrets.com and windowssecrets.com, but not the rate for ones at sublevel i.e. windowssecrets.com/myfolder.

    Is there something I have missed or is the WS site set at multiple levels but DNS displays them as root level?
    Jerry

  10. #10
    Brian Livingston
    Guest
    Most IP.Board pages are served via a file called index.php, which is at the root level. How the pages are actually stored on the server involves a long discussion.

    In light of the SEOipb.com site that jscher2000 recommended above, the Seattle developers will also add to the Lounge's robots.txt file many, but not all, of the lines that are shown on the following Web page (we've relocated some asterisks in some lines that we believe do not follow the robots standard):

    http://www.seoipb.com/robots-file

    These lines disallow Google and other robots from indexing various areas of the Lounge that are not useful content or are duplicate URLs that lead to the same page. (Duplicate pages can cause a site to be downgraded or banned by Google. For example, sites should disallow the "printer-friendly" versions of pages, which contain similar content to the normal pages.)

    You can always see the current status of our file by visiting Lounge.WindowsSecrets.com plus /robots.txt (lowercase). Thanks.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •