Search Engine Optimization starts with visibility. If search engine crawlers cannot discover, crawl, and render your pages, your site will not rank, regardless of the quality of your articles. Establishing a custom sitemap and robots.txt is the first step to technical indexing.
Demystifying Crawl Budget
Search engines do not crawl every page on the web continuously. Google assigns each website a **crawl budget**—the maximum number of pages the Googlebot crawler will request from your site during a specific timeframe. Crawl budget is determined by server response times, site popularity, and crawl limits.
If your website has broken links, bloated resource assets, or duplicate pages, Googlebot may waste its budget on these low-value URLs, neglecting your high-quality, monetize-ready blog articles. Custom configurations help prioritize critical pages.
Crafting a Strategic Robots.txt File
The robots.txt file resides in the root directory of your public web folder. It acts as a set of instructions for automated web crawlers. Here is how to structure it:
# Allow all user-agents to crawl the site
User-agent: *
Disallow: /app/
Disallow: /writable/
Disallow: /vendor/
Disallow: /admin/login
Disallow: /*?* # Block URLs with query parameters to prevent duplicate crawl issues
# Declare the location of the main sitemap index
Sitemap: https://umakantdev.com/sitemap.xml
Crucial Robots.txt Directives
- User-agent: Target specific crawlers (e.g.
Googlebot,Bingbot, or wildcard*for all). - Disallow: Tell the crawler not to access specific directories or URLs. Avoid disallowing CSS or JS files, as Googlebot needs them to render pages properly.
- Sitemap: Provide the absolute URL of your XML sitemap. This helps search engine crawlers find it during initialization.
Dynamic XML Sitemap Construction
A sitemap is an XML document listing the URLs of a website, along with metadata about each URL (last modified date, change frequency, priority). Hardcoding static XML sitemaps is inefficient for active websites, as new articles are published regularly.
By leveraging your server-side backend, you can dynamically output XML headers and loops to format a sitemap dynamically. Let's see an example of a sitemap generator using CodeIgniter 4 controller structure:
<?php
namespace App\Controllers;
use CodeIgniter\Controller;
class Sitemap extends Controller
{
public function index()
{
// Define paths and retrieve dynamic articles
$urls = [
'',
'about',
'services/web-development',
'blog'
];
// Format XML response
$xml = [];
$xml[] = '<?xml version="1.0" encoding="UTF-8"?>';
$xml[] = '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';
foreach ($urls as $url) {
$loc = base_url($url);
$xml[] = ' <url>';
$xml[] = ' <loc>' . htmlspecialchars($loc) . '</loc>';
$xml[] = ' <lastmod>' . date('Y-m-d') . '</lastmod>';
$xml[] = ' <changefreq>weekly</changefreq>';
$xml[] = ' <priority>0.8</priority>';
$xml[] = ' </url>';
}
$xml[] = '</urlset>';
return $this->response
->setHeader('Content-Type', 'text/xml')
->setBody(implode("\n", $xml));
}
}
Verification and Submission to Search Console
Once you implement these configurations, verify them using Google Search Console:
- Navigate to the **Sitemaps** report in Google Search Console.
- Input your sitemap URL (e.g.
sitemap.xml) and click **Submit**. - Monitor status updates. Google will display "Success" if the sitemap loads and parses correctly.
- Test your robots.txt file using the Google Robots.txt Tester tool to ensure your disallow rules do not block important content pages.
Conclusion
Setting up robots.txt and automated sitemaps ensures search engine crawlers find and index your articles quickly. This helps you gain search engine traffic and secure Google AdSense approval.