Google has the mission to organize the world’s information and make it universally accessible and useful. During this expedition, they encounter non-HTML files such as PDFs, spreadsheets, and presentations. Algorithms don’t let different file types slow them down; they work hard to extract the relevant content and to index it appropriately for search results. But how do they actually index these file types, and—since they often differ so much from standard HTML—what guidelines apply to these files? What if a webmaster doesn’t want us to index them?
Google first started indexing PDF files in 2001 and currently has hundreds of millions of PDF files indexed. Google has collected the most often-asked questions about PDF indexing; here are the answers:
Q: Can Google index any type of PDF file?
A: Generally Google can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they’re not password protected or encrypted. If the text is embedded as images, we may process the images with OCR algorithms to extract the text. The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, they should be able to index that text.
Q: How are links treated in PDF documents?
A: Generally links in PDF files are treated similarly to links in HTML: they can pass PageRank and other indexing signals, and may be followed after crawling the PDF file. It’s currently not possible to “nofollow” links within a PDF document.
Q: How can I prevent my PDF files from appearing in search results; or if they already do, how can I remove them?
A: The simplest way to prevent PDF documents from appearing in search results is to add an X-Robots-Tag: noindex in the HTTP header used to serve the file. If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive. For faster removals, you can use the URL removal tool in Google Webmaster Tools.
Q: Is it considered duplicate content if I have a copy of my pages in both HTML and PDF?
A: Whenever possible, it’s recommend serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML or in the HTTP headers of the PDF resource.
Q: How can I influence the title shown in search results for my PDF document?
A: Google uses two main elements to determine the title shown: the title metadata within the file, and the anchor text of links pointing to the PDF file. To give algorithms a strong signal about the proper title to use, it’s recommended updating both.