Mastering Context-Aware Output Encoding
Mastering Context-Aware Output Encoding
Output encoding transforms potentially dangerous characters into safe representations that browsers will display as text rather than interpret as code. The critical insight is that different contexts require different encoding schemes. Using HTML encoding in a JavaScript context or URL encoding in an HTML attribute can leave your application vulnerable despite your encoding efforts. Understanding these contexts and applying appropriate encoding is fundamental to XSS prevention.
HTML context encoding is the most common requirement, used when inserting untrusted data between HTML tags. The essential characters to encode are less than (<) to <, greater than (>) to >, ampersand (&) to &, double quote (") to ", and single quote (') to '. However, encoding requirements become more complex in attribute contexts. When inserting data into quoted attributes, you must encode the quote character used for that attribute. Unquoted attributes require encoding virtually all non-alphanumeric characters because spaces, equals signs, and other characters can break out of the attribute context.
JavaScript encoding presents unique challenges because data often passes through multiple interpretation layers. When server-side code generates JavaScript containing user data, that data needs JavaScript-specific encoding: backslashes become \, quotes become ' or ", newlines become \n, and other control characters need appropriate escape sequences. If that JavaScript then inserts data into the DOM, HTML encoding is also required. This double-encoding requirement frequently causes vulnerabilities when developers apply only one encoding layer.