PHP file_get_contents giving garbled output

Are you getting garbled values when using file_get_contents fetching an external URL (website scrapping using php) and wondering what went wrong ? The content is appearing to have wierd characters instead of normal HTML output because of GZIP output encoding by the website you are fetching.

<?php echo file_get_contents("http://www.example.com"); ?>

You can fix this issue in multiple ways.

Method 1 : Accept-Encoding

In this method we set the request headers so that the webserver is forced to respond with a plain response. This is done by setting the qValue in Accept-Encoding of GZIP and compress to 0 and attaching it to the header, This is done using the stream feature in PHP, the following code illustrates the setting of the headers while fetching file_get_contents.

// Create a stream $opts = array( 'http'=>array( 'method' => "GET", 'header' => "Accept-Encoding: gzip;q=0, compress;q=0\r\n", //Sets the Accept Encoding Feature. ) ); $context = stream_context_create($opts); // Open the file using the HTTP headers set above echo file_get_contents("http://www.example.com", false, $context);

This should force the server to send a plain response to your client.

Method 2 : Unzipping the contents locally

This method is slower but useful if the server forces gzip output. We can use the gzdecode function avaliable from PHP 6.0 or the gzdecode function definition given in gzdecode man page on PHP.net. The code to deflate is given below :

echo gzdecode(file_get_contents("http://www.example.com")); if(!function_exists("gzdecode")) { function gzdecode($data) { $len = strlen($data); if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) { return null; // Not GZIP format (See RFC 1952) } $method = ord(substr($data,2,1)); // Compression method $flags = ord(substr($data,3,1)); // Flags if ($flags & 31 != $flags) { // Reserved bits are set -- NOT ALLOWED by RFC 1952 return null; } // NOTE: $mtime may be negative (PHP integer limitations) $mtime = unpack("V", substr($data,4,4)); $mtime = $mtime[1]; $xfl = substr($data,8,1); $os = substr($data,8,1); $headerlen = 10; $extralen = 0; $extra = ""; if ($flags & 4) { // 2-byte length prefixed EXTRA data in header if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $extralen = unpack("v",substr($data,8,2)); $extralen = $extralen[1]; if ($len - $headerlen - 2 - $extralen < 8) { return false; // Invalid format } $extra = substr($data,10,$extralen); $headerlen += 2 + $extralen; } $filenamelen = 0; $filename = ""; if ($flags & 8) { // C-style string file NAME data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $filenamelen = strpos(substr($data,8+$extralen),chr(0)); if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) { return false; // Invalid format } $filename = substr($data,$headerlen,$filenamelen); $headerlen += $filenamelen + 1; } $commentlen = 0; $comment = ""; if ($flags & 16) { // C-style string COMMENT data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0)); if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) { return false; // Invalid header format } $comment = substr($data,$headerlen,$commentlen); $headerlen += $commentlen + 1; } $headercrc = ""; if ($flags & 1) { // 2-bytes (lowest order) of CRC32 on header present if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $calccrc = crc32(substr($data,0,$headerlen)) & 0xffff; $headercrc = unpack("v", substr($data,$headerlen,2)); $headercrc = $headercrc[1]; if ($headercrc != $calccrc) { return false; // Bad header CRC } $headerlen += 2; } // GZIP FOOTER - These be negative due to PHP's limitations $datacrc = unpack("V",substr($data,-8,4)); $datacrc = $datacrc[1]; $isize = unpack("V",substr($data,-4)); $isize = $isize[1]; // Perform the decompression: $bodylen = $len-$headerlen-8; if ($bodylen < 1) { // This should never happen - IMPLEMENTATION BUG! return null; } $body = substr($data,$headerlen,$bodylen); $data = ""; if ($bodylen > 0) { switch ($method) { case 8: // Currently the only supported compression method: $data = gzinflate($body); break; default: // Unknown compression method return false; } } else { // I'm not sure if zero-byte body content is allowed. // Allow it for now... Do nothing... } // Verifiy decompressed size and CRC32: // NOTE: This may fail with large data sizes depending on how // PHP's integer limitations affect strlen() since $isize // may be negative for large sizes. if ($isize != strlen($data) || crc32($data) != $datacrc) { // Bad format! Length or CRC doesn't match! return false; } return $data; } }

Happy Scraping :)

08 Mar, 2010
Comments (1)
  • Much easier way

    Below is a way easier way to decode if gzdecode is not defined in your PHP install

    function gzdecode($data){
    $g=tempnam('/tmp','ff');
    @file_put_contents($g,$data);
    ob_start();
    readgzfile($g);
    $d=ob_get_clean();
    return $d;
    }

    By rynop on 15 Jun, 2011
You may also like
Tags
On Facebook
Email Newsletter