So its been awhile since we did a blog entry, Mainly because we have been studying SEO and taking full advantage of your robots.txt to help get better results.
It came to our attention because we had all the SEF turned on, Google still was indexing none SEF urls.
We had 758 pages indexed. Now every day we are seeing pages vanish (thats our goal)
First off if you didnt know, Install xmap it generated a sitemap not only for your site but it gives you an xml built on the fly that can be submitted to google, yahoo, ask and msn and several other sites.
You can also add this to your robots.txt
Sitemap: http://www.joomlamafia.com/index.php?option=com_xmap&sitemap=1&view=xml&no_html=1
Whats this do? Every time a bot comes to your site it looks at robots.txt to know what not to index and what to index... It just found out first thing is all bots this applies to, heres my sitemap, heres all my pages!!! But if any of these links are present (which we know they are not because this is what all our viewers of our site get to see when they click sitemap link)
Next thing we added was
Disallow: index2.php
and then
Disallow: /component/content/article/* Disallow: /component/search/*
Disallow: /component/*
This keeps any component pages from getting indexed, as well as any search pages and the firt keeps the articles in none sef format from being indexed and causing double content.
How Can I Apply This To My Site?
Regularly check what kind of pages Google is indexing on your site and look for patterns. If there are a lot of PDF pages, or dozens of useless links from a particular component, you can act quickly to block them out with robots.txt. Use the site:mydomain.com search function or a tool such as WebCEO.com.
Among the most important things you can do is check your pages that are in Google's supplemental index. This is where you'll find lots of your low-quality pages, ripe for removal by robots.txt. If the pages don't contain useful information, dump them.
Originally the wildcard wasn't supported by robots.txt but that has since changed. Both Google and Yahoo now recognize it: Regularly check what kind of pages Google is indexing on your site and look for patterns.
If there are a lot of PDF pages, or dozens of useless links from a particular component, you can act quickly to block them out with robots.txt. Use the site:mydomain.com search function or a tool such as
WebCEO.com.
Among the most important things you can do is check your pages that are in Google's supplemental index. This is where you'll find lots of your low-quality pages, ripe for removal by robots.txt. If the pages don't contain useful information, dump them.