robots.txt Generator

วิธีการทำงานของ Robots.txt Generator

เครื่องมือสร้าง robots.txt นี้จะสร้างไฟล์ robots.txt ที่มีไวยากรณ์ถูกต้องสำหรับเว็บไซต์ของคุณ เลือก user agent กำหนดกฎ Disallow และ Allow ตั้งค่า Crawl-delay และเพิ่ม URL ของ sitemap — เครื่องมือจะสร้างไฟล์สำเร็จพร้อมวางที่ root ของโดเมนของคุณ มีตัวตรวจสอบในตัวที่ช่วยตรวจจับข้อผิดพลาดทางไวยากรณ์ที่พบบ่อย ซึ่งอาจทำให้ crawler ตีความคำสั่งของคุณผิดหรือเพิกเฉยต่อคำสั่งเหล่านั้น

แต่ละบล็อกใน robots.txt เริ่มต้นด้วยคำสั่ง User-agent: เพื่อระบุว่ากฎนั้นใช้กับ crawler ใด User-agent: * ใช้กฎกับ robot ทุกตัว Agent เฉพาะ เช่น Googlebot, Bingbot หรือ GPTBot สามารถมีบล็อกกฎของตัวเองที่มีความสำคัญเหนือกว่า wildcard ลำดับของกฎภายในบล็อกมีความสำคัญ — คำสั่ง Allow จะมีความสำคัญเหนือกว่า Disallow เมื่อทั้งสองตรงกับ path เดียวกันในการตีความของ Google

บริษัท AI หลายแห่งส่ง crawler เพื่อดึงเนื้อหาสำหรับฝึกโมเดล bot ที่ควรบล็อกได้แก่ GPTBot (OpenAI), CCBot (Common Crawl), Google-Extended (Google Gemini training), anthropic-ai (Anthropic Claude) และ PerplexityBot เครื่องมือนี้มีตัวเลือกคลิกเดียวเพื่อเพิ่มกฎ Disallow: / สำหรับ crawler ที่ใช้ฝึก AI ที่รู้จักทั้งหมด — เป็นตัวเลือกยอดนิยมสำหรับผู้เผยแพร่ที่ต้องการควบคุมการนำเนื้อหาไปใช้ในชุดข้อมูลฝึก AI

Disallow: /admin/ ป้องกัน search engine จากการ crawl backend CMS ของคุณ Disallow: /checkout/ ทำให้หน้าตะกร้าสินค้าและการชำระเงินไม่ถูก index Disallow: /search? บล็อกหน้าผลการค้นหาภายในที่สร้างเนื้อหาซ้ำซ้อนที่มีคุณค่าต่ำ Disallow: /*.pdf$ ยกเว้นไฟล์ PDF อย่างไรก็ตาม robots.txt เป็นเพียงคำแนะนำ — มันป้องกัน crawler ที่ทำตามกฎ แต่ไม่ได้ให้ความปลอดภัย หน้าที่มีข้อมูลสำคัญยังต้องมีการควบคุมการเข้าถึงด้วย

Crawl-delay: 10 ขอให้ crawler รอ 10 วินาทีระหว่าง request เพื่อลดภาระเซิร์ฟเวอร์จากการ crawl ที่เร็วเกินไป Google ไม่สนใจ Crawl-delay (ใช้การตั้งค่า crawl rate ใน Search Console แทน) แต่ Bing และ bot อื่นๆ หลายตัวให้ความเคารพ คำสั่ง Sitemap: (เช่น Sitemap: https://example.com/sitemap.xml) สามารถวางที่ใดก็ได้ใน robots.txt และแจ้ง crawler ทุกตัวว่าจะหา sitemap ได้ที่ไหน โดยไม่ต้องส่งแบบ manual ไปยัง search engine แต่ละตัวแยกกัน

คำถามที่พบบ่อย

What is robots.txt?

robots.txt is a text file at the root of your website that tells search engine crawlers which pages or paths they are allowed or disallowed from crawling. It is a voluntary standard — well-behaved bots respect it, but malicious bots may ignore it.

Can I block AI training bots with robots.txt?

Yes. Major AI companies respect robots.txt disallow rules. GPTBot (OpenAI), CCBot (Common Crawl used for training), anthropic-ai (Anthropic), and Google-Extended (Google Bard) all respect Disallow: / for their user agents.

Does Disallow: / block all bots?

Disallow: / for a specific User-agent blocks that bot from crawling your entire site. To block all bots: set User-agent: * followed by Disallow: /. However, blocking Googlebot will prevent your site from being indexed in Google.

What is Crawl-delay?

Crawl-delay tells the bot to wait a specified number of seconds between requests. This prevents aggressive crawlers from overloading your server. Note: Googlebot ignores Crawl-delay — use Google Search Console's crawl rate settings instead.

Should I add my sitemap to robots.txt?

Yes. Adding a Sitemap: directive at the bottom of robots.txt tells all crawlers (not just Google) where your sitemap lives. This is in addition to submitting it in Google Search Console.

วิธีการทำงานของ Robots.txt Generator

คำถามที่พบบ่อย

เครื่องมือที่เกี่ยวข้อง