PDA

View Full Version : Discussion Board Text Parser Advice/Help



yaz
12 May 2006, 12:31 PM
I just finished building a completely custom discussion board system, its mostly done, but the only part that is giving me problems is the parsing of the text that gets inserted into the database and then shown on the thread/post page.

Here is the process that it goes through

1) You enter your text in the post page. It should allow you to enter some basic html, some BBCode, Smilies, etc.
2) When you click submit is that the text gets parsed, so, I need the html tags to be checked for validity, the bbcode to be checked and converted to html tags, images converted to proper tags, etc.
3) When viewing the page, NO PROCESSING is done to the text. (This makes for faster loading times).

My problem is in step 2. I have the code that does all of it, but its hard to customize. I'm having problems with regular expressions mostly. So here is what I need help with, it would be great if you could point me in the direction of a script that does this better or a way to fix it.

This is what I need:

-I need to be able to specify what HTML I want to use.
-Hopefully I want to make the bbcode/html it outputs into XHTML code.
-A way to convert HTML back into BBCode would be completly amazing.

So, here is the code:



/**
* Converts BBCode to HTML
*/
function parsebbcode($text) {
$searcharray = array(
"/(\[)(list)(=)(['\"]?)([^\"']*)(\\4])(.*)(\[\/list)(((=)(\\4)([^\"']*)(\\4]))|(\]))/esiU",
"/(\[)(list)(])(.*)(\[\/list\])/esiU",
"/(\[)(url)(=)(['\"]?)([^\"']*)(\\4])(.*)(\[\/url\])/esiU",
"/(\[)(url)(])(.*)(\[\/url\])/esiU",
"/(\[)(code)(])(\r\n)*(.*)(\[\/code\])/esiU",
"/(\[)(php)(])(\r\n)*(.*)(\[\/php\])/esiU"
);
$replacearray = array(
"createlists('\\7', '\\5')",
"createlists('\\4')",
"checkurl('\\5', '\\7')",
"checkurl('\\4')",
"stripbrsfromcode('\\5')",
"phphighlite('\\5')"
);

$doubleRegex = "/(\[)(%s)(=)(['\"]?)([^\"']*)(\\4])(.*)(\[\/%s\])/siU";
$singleRegex = "/(\[)(%s)(])(.*)(\[\/%s\])/siU";

$bbcodes_q = QUERY see bellow.
/*
Grabs bbcode that we allow from a database, these are two examples of rows returned from a database with this query

Example 1:
id = 1
tag = b
replacement = <strong>\4</strong>
example = Bold
twoparams (flag) = 0

Example 2:
id = 2
tag = link
replacement = <a href="\4">\4</a>
example = google
twoparams (flag) = true

*/

while($r = mysql_fetch_array($bbcodes_q)) {
if ($r['twoparams']) {
$regex = sprintf($doubleRegex, $r['tag'], $r['tag']);
} else {
$regex = sprintf($singleRegex, $r['tag'], $r['tag']);
}
$searcharray[] = $regex;
$replacearray[] = $r['replacement'];
// and get nested ones:
$searcharray[] = $regex;
$replacearray[] = $r['replacement'];
$searcharray[] = $regex;
$replacearray[] = $r['replacement'];
}

$text = str_replace("\\'", "'", $text);
$text = preg_replace($searcharray, $replacearray, $text);
return $text;
}

/**
* Clean included HTML tags (called from parsehtml)
* @param array $tag
* Original code from phpBB
*/
function clean_html($tag)
{
if (empty($tag[0])) { return ''; }

$allowed_html_tags = preg_split('/, */', 'b,i,u,em,strong,span,href,src,a,img,center,br,div,li,ol,ul');
$disallowed_attributes = '/^(?:style|on)/i';

// Check if this is an end tag
preg_match('/<[^\w\/]*\/[\W]*(\w+)/', $tag[0], $matches);
if (sizeof($matches)) {
if (in_array(strtolower($matches[1]), $allowed_html_tags)) {
return '</' . $matches[1] . '>';
} else {
return htmlspecialchars('</' . $matches[1] . '>');
}
}

// Check if this is an allowed tag
if (in_array(strtolower($tag[1]), $allowed_html_tags)) {
$attributes = '';
if (!empty($tag[2])) {
preg_match_all('/[\W]*?(\w+)[\W]*?=[\W]*?(["\'])((?:(?!\2).)*)\2/', $tag[2], $test);
for ($i = 0; $i < sizeof($test[0]); $i++) {
if (preg_match($disallowed_attributes, $test[1][$i])) {
continue;
}
$attributes .= ' ' . $test[1][$i] . '=' . $test[2][$i] . str_replace(array('[', ']'), array('[', ']'), htmlspecialchars($test[3][$i])) . $test[2][$i];
}
}
if (in_array(strtolower($tag[1]), $allowed_html_tags)) {
return '<' . $tag[1] . $attributes . '>';
} else {
return htmlspecialchars('<' . $tag[1] . $attributes . '>');
}
}
// Finally, this is not an allowed tag so strip all the attibutes and escape it
else {
return htmlspecialchars('<' . $tag[1] . '>');
}
}

/**
* Parses HTML Code & Cleans it
* This approach is quite agressive and anything that does not look like a valid tag
* is going to get converted to HTML entities
*/
function parsehtml($text) {
$text = stripslashes($text);
$html_match = '#<[^\w<]*(\w+)((?:"[^"]*"|\'[^\']*\'|[^<>\'"])+)?>#';
$matches = array();

$text_split = preg_split($html_match, $text);
preg_match_all($html_match, $text, $matches);

$text = '';

foreach ($text_split as $part)
{
$tag = array(array_shift($matches[0]), array_shift($matches[1]), array_shift($matches[2]));
$text .= htmlspecialchars($part) . clean_html($tag);
}

$text = addslashes($text);
return $text;
}


I know this is quite difficult, and more than just a quick fix, but any help is greatly appreciated.