Encoding with Datafeeds Explained

3 years ago

Originally Published: 2013-08-15

Article Number

000065684

Issue

My text is not formatted as I expect when importing the content in to Archer using a datafeed.
How do I properly format my HTML in data being imported using datafeeds?
What do I do if I don't have control over how my content is being encoded and imported to Archer via a datafeed?
5.x

Resolution

All data that comes into Archer is sanitized to ensure that HTML security vulnerabilities (such as cross-site scripting, javascript, etc) are excluded from the content. Whether the data comes into the system through data feed or through the UI this sanitization takes place. In this regard there is one difference between the UI and Data Feed, the UI knows what information in the content is markup and what information is not; Data Feed has no way to make that determination.

Let?s expand on that idea. Let?s take the example of the following string: ?Text is bold?. This string can be entered in the UI in three different ways, it can be entered directly into the Text Area field exactly as shown here, it can be entered into the HTML Editor exactly as shown here, or the value ?Text is bold? can be entered into the Text Area, highlighted and then the Bold button in the toolbar can be pressed making the font change to bold. If the information is entered in the first manner then the and are considered part of the content and when displayed in the browser it will be displayed like ?Text is bold?. If the string is entered in one of the last two ways then the and are considered markup and the string is displayed as ?Text is bold?.

In the above example, if the user entered the string in the Text Area as in the first case, it would actually be stored in the database as ?Text is bold?. If the user entered the string in one of the last two cases it would be stored in the database as ?Text is bold?. This is how Archer can determine whether content is real content or markup. When the string is actual content it escapes any characters that could be considered markup characters. This escaping (also known as encoding) is not something made up by Archer, this HTML encoding is a universal standard. In fact, Archer does not do anything fancy to this information when it displays it in a browser, it pulls the information from the database and renders it to the browser exactly as stored in the database. The browser understands this encoding and automatically decodes the string so it looks correct.

Now that we understand how the UI works, how can this same thing be accomplished in Data Feed. Well, the reality is Data Feed has no way of identifying whether information is content or markup so the only way to distinguish between them is for the client to provide the proper encoding in the source file. So if the content should really be ?Text is bold? then the client must provide it encoded in the source file (like this ?Text is bold?). Basically Data Feed stores content in the manner it is in the source file. So if it was not encoded it is would be stored in the database like ?Text is bold? resulting in it being rendered in the browser as ?Text is bold?.

Well, you say, that is all well and good but why do I have to encode content like ?Is 28 really < 35??, because it is clear to me that the ?<? sign is not markup. This is where the sanitization takes place. As I stated at the beginning Archer sanitizes all content. Part of this sanitization process ?cleans up? improperly formatted html. This is necessary because most browsers attempt to display html even when it is malformed. But the problem with the browser trying to be helpful is that it also could allow vulnerabilities to pass through to the HTML. So our sanitizer attempts to fix up the markup before sanitizing it to ensure vulnerabilities do not slip through. In this example the sanitizer sees there is an opening markup tag with no closing markup tag, so it creates a closing markup tag automatically (and adds a few more things to make it what it considers valid markup). Therefore anytime characters that could be interpreted as markup characters are used in content they must be encoded when they are not intended to be used as markup characters.

In some cases the you may not have control over the source file, maybe it is coming from a third party web site or something in this manner. In these cases there is a calculated field can be created using a function called HtmlEncode([fieldname]). The fieldname is the name of the field needing to be encoded. The target application field can be mapped to the calculated field and the value will be properly encoded before stored in the database.

Here is a list of the characters that need to be encoded:

Character	Encoded Value	Description
?	"	double quotation mark
&	&	ampersand
?	&pos;	apostrophe
<	<	less-than sign
>	>	greater-than sign

Don't see what you're looking for?

Potential Impact from Salesforce Device Activation Change

Related Articles

Trending Articles