Community:NDR/Encoding

From NSDLWiki

Jump to: navigation, search

Contents

[hide]

NDR API Documentation

Encoding

There are three different layers of encoding that must all be correct in NDR API requests: character encoding (UTF-8), XML entities, and URL escaping. When a request has an XML parameter such as inputXML, then invalid UTF-8 byte sequences, unrecognized XML entities, invalid XML or un-escaped URL characters can make it impossible to process a request, or worse, can cause a request to be processed with incorrect information.

UTF-8 character encoding

UTF-8 is a Unicode character encoding which represents ASCII characters as a single byte, but can use up to six bytes for some Unicode characters. Unicode characters outside the simple ASCII range (32 ( ) through 127 ()) must be encoded either as multi-byte UTF-8 sequences or using numerical entities (see XML entities below).

UTF-8 is the default encoding for XML and most modern XML authoring tools will automatically encode new XML instances in UTF-8. However, this is not always the case, and it is not uncommon for software to erroneously use a different encoding scheme when entering diacritics, symbols, and the like because the software is utilizing a different encoding. So always be sure that the XML encoding is set to UTF-8.

Not only must all character data be correctly UTF-8 encoded, but the UTF-8 character encoding must be correctly specified in the XML declaration, which is the first line of any XML document: <?xml version="1.0" encoding="UTF-8"?>

For more information on UTF-8 encoding, see:

XML entities

There are five characters that are markup delimiters in XML, and therefore can never appear in their literal form in XML character data (such as the text value of an element). If these characters are needed as literals, the following named entities MUST be used:

  • &amp; for & (ampersand)
  • &lt; for < (left angle bracket, less-than sign)
  • &gt; for > (right angle bracket, greater-than sign)
  • &quot; for " (quotation mark)
  • &apos; for ' (apostrophe)

As with any XML entity reference, these begin with an ampersand and end with a semicolon (which makes it clear why ampersands must be represented with the entity &amp; - 'cause otherwise they are interpreted as the beginning of an entity).

Note that this is true even when the characters are part of a URL within an XML element, such as

<dc:identifier>http://fake.example.com?arg1=5&arg2=8</dc:identifier>
MUST be changed to
<dc:identifier>http://fake.example.com?arg1=5&amp;arg2=8</dc:identifier>

No other named entities are allowed in XML arguments. For example, the acute a character may not be represented as &aacute; because XML Schema do not allow DTD-style entities.

Numerical entities are allowed: the acute a could be represented as a hexadecimal numerical entity: &#xE1;. Another option would be to use the correct UTF-8 character representation in the XML.

Sometimes text is XML encoded by a program in more than one context. Since the ampersand character "&" is used for XML encoding, when XML is doubly encoded, it contains undesirable sequences such as "&amp;amp;" or "&amp;gt;" or “&amp;#xE1;”. These should be corrected (to "&amp;", "&gt;" and “&#xE1;”, respectively).

For more information on XML character entites, see

URL encoding

Certain characters in URLs have special meaning in certain contexts, such as "?" at the start of the query arguments or ":" just after the scheme (e.g. "http:"). When the special meaning is not desired, these characters need to be encoded using "%xx" where xx is the hexadecimal code point for the character's ASCII value. For example, the space character corresponds to 32 in ASCII, which is 20 in hexadecimal. So a space would be represented as "%20" within a URL:

<dc:identifier>http://fake.example.com?arg1=NSDL%20rules</dc:identifier>

Which exact characters must be URL encoded is a tricky question, as different software may process URLs differently. For example, some XML parsers, such as Xerces (http://xerces.apache.org/) will give an error when parsing a URI containing a space, even though many web browsers will interpret the space without URL encoding.

You MUST encode the following characters within URLs (when they are not to be interpreted as markup delimiters):

characters that have special meaning to URL parsers
    • Colon (":") --> %3A
    • Forward slash/Virgule ("/") --> %2F
    • Question mark ("?") --> %3F
    • Equals ("=") --> %3D
    • Ampersand ("&") --> %26
    • 'Pound' character ("#") --> %23
    • Percent character ("%") --> %25
additional characters that MUST be URL encoded per the URI spec
    • Left Square Bracket ("[") --> %5B
    • Right Square Bracket ("]") --> %5D
    • 'At' symbol ("@") --> %40
    • Plus ("+") --> %2B (but if this is to be interpreted as a space, then it should be %20, which is the encoding for " ")
    • Exclamation Point ("!") --> %21
    • Dollar ("$") --> %24
    • Asterisk ("*") --> %2A
    • Apostrophe ("'") --> %27
    • Right Parenthesis ("(") --> %28
    • Left Parenthesis (")") --> %29
    • Semi-colon (";") --> %3B
    • Comma (",") --> %2C

We strongly recommend encoding the following characters within URLs as well (due to now obsolete URL specs):

    • Space (" ") --> %20
    • Quotation mark ('"') --> %22
    • Caret ("^") --> %5E
    • Backslash ("\") --> %5C
    • Grave Accent ("`") --> %60
    • 'Less Than' symbol ("<") --> %3C
    • 'Greater Than' symbol (">") --> %3E
    • Left Curly Brace ("{") --> %7B
    • Right Curly Brace ("}") --> %7D
    • Vertical Bar/Pipe ("|") --> %7C

For more information on URL encoding, see:

More information on character encoding issues

Personal tools