MyBB Community Forums

Full Version: Can someone help me with a parser
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
When it comes to parsing HTML i really know nothing about it.

I want to fetch info from anidb.net.

			<tr class="official verified no"> 
				<th class="field">Official Title</th> 
				<td class="value"> 
				<span class="icons"> 
<span class="i_icon i_audio_en" title=" language: english"><span>en</span></span> 
				</span> 
				<label>Red Riding Hood Chacha</label></td> 
			</tr> 
			<tr class="g_odd official verified no"> 
				<th class="field">Official Title</th> 
				<td class="value"> 
				<span class="icons"> 
<span class="i_icon i_audio_ja" title=" language: japanese"><span>ja</span></span> 
				</span> 
				<label>赤ずきんチャチャ</label></td> 
			</tr> 
			<tr class="type"> 
				<th class="field">Type</th> 
				<td class="value">TV Series, 74 episodes</td> 
			</tr> 
			<tr class="g_odd year"> 
				<th class="field">Year</th> 
				<td class="value">07.01.1994 till 30.06.1995</td> 
			</tr>

This is a sample of what i want to parse(with some more i have not posted).

I want to parse the value of each td(the contents between <td class="value">{content}</td>) within the specific TR tag.

But i have no idea how to do it, nor will the script know i want the year, type etc.

Any code that will help me?
From what I understand, you're wanting to grab specific values from elements upon the page. Correct?

If so, preg_match would work perfectly. I'm not so hot with Regular Expressions myself though, so somebody may be able to help you out more on that front.
Try this:
<?php
if(!function_exists("gzdecode"))
{
	function gzdecode($data,&$filename='',&$error='',$maxlength=null) 
	{
		$len = strlen($data);
		if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) {
			$error = "Not in GZIP format.";
			return null;  // Not GZIP format (See RFC 1952)
		}
		$method = ord(substr($data,2,1));  // Compression method
		$flags  = ord(substr($data,3,1));  // Flags
		if ($flags & 31 != $flags) {
			$error = "Reserved bits not allowed.";
			return null;
		}
		// NOTE: $mtime may be negative (PHP integer limitations)
		$mtime = unpack("V", substr($data,4,4));
		$mtime = $mtime[1];
		$xfl   = substr($data,8,1);
		$os    = substr($data,8,1);
		$headerlen = 10;
		$extralen  = 0;
		$extra     = "";
		if ($flags & 4) {
			// 2-byte length prefixed EXTRA data in header
			if ($len - $headerlen - 2 < 8) {
				return false;  // invalid
			}
			$extralen = unpack("v",substr($data,8,2));
			$extralen = $extralen[1];
			if ($len - $headerlen - 2 - $extralen < 8) {
				return false;  // invalid
			}
			$extra = substr($data,10,$extralen);
			$headerlen += 2 + $extralen;
		}
		$filenamelen = 0;
		$filename = "";
		if ($flags & 8) {
			// C-style string
			if ($len - $headerlen - 1 < 8) {
				return false; // invalid
			}
			$filenamelen = strpos(substr($data,$headerlen),chr(0));
			if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) {
				return false; // invalid
			}
			$filename = substr($data,$headerlen,$filenamelen);
			$headerlen += $filenamelen + 1;
		}
		$commentlen = 0;
		$comment = "";
		if ($flags & 16) {
			// C-style string COMMENT data in header
			if ($len - $headerlen - 1 < 8) {
				return false;    // invalid
			}
			$commentlen = strpos(substr($data,$headerlen),chr(0));
			if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) {
				return false;    // Invalid header format
			}
			$comment = substr($data,$headerlen,$commentlen);
			$headerlen += $commentlen + 1;
		}
		$headercrc = "";
		if ($flags & 2) {
			// 2-bytes (lowest order) of CRC32 on header present
			if ($len - $headerlen - 2 < 8) {
				return false;    // invalid
			}
			$calccrc = crc32(substr($data,0,$headerlen)) & 0xffff;
			$headercrc = unpack("v", substr($data,$headerlen,2));
			$headercrc = $headercrc[1];
			if ($headercrc != $calccrc) {
				$error = "Header checksum failed.";
				return false;    // Bad header CRC
			}
			$headerlen += 2;
		}
		// GZIP FOOTER
		$datacrc = unpack("V",substr($data,-8,4));
		$datacrc = sprintf('%u',$datacrc[1] & 0xFFFFFFFF);
		$isize = unpack("V",substr($data,-4));
		$isize = $isize[1];
		// decompression:
		$bodylen = $len-$headerlen-8;
		if ($bodylen < 1) {
			// IMPLEMENTATION BUG!
			return null;
		}
		$body = substr($data,$headerlen,$bodylen);
		$data = "";
		if ($bodylen > 0) {
			switch ($method) {
			case 8:
				// Currently the only supported compression method:
				$data = gzinflate($body,$maxlength);
				break;
			default:
				$error = "Unknown compression method.";
				return false;
			}
		}  // zero-byte body content is allowed
		// Verifiy CRC32
		$crc   = sprintf("%u",crc32($data));
		$crcOK = $crc == $datacrc;
		$lenOK = $isize == strlen($data);
		if (!$lenOK || !$crcOK) {
			$error = ( $lenOK ? '' : 'Length check FAILED. ') . ( $crcOK ? '' : 'Checksum FAILED.');
			return false;
		}
		return $data;
	}
}


$contents = file_get_contents("http://anidb.net/perl-bin/animedb.pl?show=anime&aid=854");
$contents = gzdecode($contents);
if(preg_match_all("#\<th\ class\=\"field\"\>(.*?)\<\/th\>.*?\<td\ class\=\"value\"\>(.*?)\<\/td\>#si", $contents, $match))
{
	$count = count($match[0]);
	$info = array();
	for($i=0;$i<$count;$i++)
	{
		$field = trim(strip_tags($match[1][$i]));
		$value = trim(strip_tags($match[2][$i]));
		$info[$field] = $value;
	}
	
	var_dump($info);
}

The webserver forces gzipped data so we need to decode it. gzdecode() is not yet available in PHP5 so that needs to be added. It's a function that I copied from the PHP comments and it seems to work fine.
Thank you a lot. Works like a charm. If i ever come near where you live, all the beer's on me Big Grin