Context-aware HTML escaping

Školení, která pořádám

This article was published as the Month of PHP Security Submission.

Introduction

Cross-site scripting (XSS) is one of the most common vulnerabilities in web applications. Defense against this attack on HTML pages is however quite simple – it is enough to change characters < and & which have a special meaning in HTML text to the corresponding entities &lt; and &amp; just before outputting any untrusted data. If we want to output an untrusted data inside a quoted value of HTML attribute (like title="") we have to escape quotes to &quot;.

PHP offers this escaping by a function htmlspecialchars which encodes all three special characters (and > as a bonus). It is also possible to escape ' which can be used to quote HTML attribute value by passing a second parameter with value ENT_QUOTES.

It is important to note that this function can be safely used only to escape HTML text (data between tags without a special meaning like have <script> or <style>) or inside a quoted value of an attribute without a special meaning (like have onmouseover="" or style=""). Other contexts (like tag or attribute name, unquoted attribute value, or HTML comments) are unsafe even after usage of this function.

It is also important to note that the page must explicitly set character set by Content-Type header. It can be otherwise fooled to use character set where other characters have special meaning (like UTF-7). The character set can be also passed to htmlspecialchars but it is not required with one-byte character sets or UTF-8.

Note: htmlentities function can be used with ancient encodings to encode some characters which do not exist in the character set. This function is not required with Unicode which covers all characters.

Automatic escaping

If the defense against XSS is so simple (use htmlspecialchars to any printed data) then why it is such a common vulnerability? The reason is that programmers often forget to use it. Sometimes they use it on usual XSS targets like discussions but they forget to use it in search or registration form. Another time they escape all forms but forget about URL parameters.

The best way to not forget about escaping is to automate it. Most templating engines offer an automatic escaping. For example Smarty offers a $default_modifiers variable which can add an escape filter escaping all printed data. These default modifiers can be disabled by smarty:nodefaults filter applied to any variable so it is still possible to output a trusted HTML code but it requires a longer code. It is an important observation – shorter code is more secure.

I see this automatic escaping as one of the most important features of templating engines (other one is a separation of HTML and PHP code). Pure PHP templates do not offer this feature.

Note: Other option is to generate an XML data from PHP script through DOM or other PHP extension and convert them to HTML by an XSL template. This approach is equivalent to automatic escaping because text content created by e.g. createTextNode is serialized with special characters converted to entities. Creating such applications is however more difficult than using classic templates and requires more resources to generate the page.

Context-aware escaping

There are still other contexts with different sets of special characters, most importantly the <script> tag and JavaScript event handlers. They can be usually separated to an external file but sometimes not (for example initializing a user-specific JavaScript variable is better to do in the inline <script> tag). Most importantly the decision of the data usage is up to the template author. He can decide to use the data escaped for HTML inside a JavaScript event. The important part of our application security is in hands of an HTML coder! This is often a guy with brilliant color sensitivity who however hardly understood loops.

The solution of this problem lies in the context-aware escaping which improves the automatic escaping to recognize the context and choose escaping function appropriate for this context.

The first templating engine with context-aware escaping is probably the Google's ctemplate which is available for C++. The only context-aware escaping template engine for PHP known to the author is Nette Latte which is a part of the Nette Framework but can also be used independently.

Note: The Nette Framework is created with emphasis on security which is visible not only in templates but also in all other parts. For example the defense against Cross-site request forgery in the framework forms is easy.

Nette Latte

Nette Latte templating engine automatically recognizes following contexts:

This allows writing even complicated (but still realistic) code without any manual escaping:

<script type="text/javascript">
var userId = {$userId};
</script>
<p style="color: {$color};" title="{$title}">
<a href="" onclick="return !confirm({$message});">{$desc}</a>
</p>
<!-- Executed in: {$time} s -->

If you try to escape this code by hand without any restrictions on the variable values then you will probably find it very difficult.

Please note that variables used inside the JavaScript code are unquoted. Consider them as usual JS variables – PHP numbers are printed as JS numbers, PHP strings as JS strings, PHP associative arrays as JS object literals and so on.

Automatic context-aware escaping can't be disabled by some magic filter (like in Smarty) but there is a separate syntax to print a raw variable value: {!$var}. Again – less code means more security, moreover exclamation mark points to something possibly dangerous.

Note: There is no special context for URLs. The reason is that links are created with a separate tag {link}.

Summary

Escaping of HTML special characters is simple but it can be easily forgotten. Automatic escaping solves this problem but doesn't respect contexts with different special characters such as JavaScript. Context-aware escaping comes to the rescue. Nette Latte is a solid templating engine for PHP with this feature.

Jakub Vrána, Výuka, 5.5.2010, comments: 3 (new: 0)

Comments

Martin:

Zajímavé, ale jaký to má vliv na výkon? Jesli se dělá syntaktická analýza každé šablony, tak to asi nebude nic moc.

jarda:

Ta se dělá jen při překladu (kompilaci) šablony, dále je výsledek již připraven v cache.

analytik:

"Like have" should be "e.g." in English, or "for example,"

Diskuse je zrušena z důvodu spamu.

avatar © 2005-2025 Jakub Vrána. Publikované texty můžete přetiskovat pouze se svolením autora. Ukázky kódu smíte používat s uvedením autora a URL tohoto webu bez dalších omezení Creative Commons. Můžeme si tykat. Skripty předpokládají nastavení: magic_quotes_gpc=Off, magic_quotes_runtime=Off, error_reporting=E_ALL & ~E_NOTICE a očekávají předchozí zavolání mysql_set_charset. Skripty by měly být funkční v PHP >= 4.3 a PHP >= 5.0.