Extensible Markup Language (XML) in computing terms is a type of markup language where a set of rules are defined for encoding documents into a format that can be read by both machines as well as humans. It is defined by several free open standards, including the W3C XML 1.0 Specifications along with a number of other related specifications.
The design goals focus on usability, generality and simplicity over the Internet. It is a type of textual data format that is strongly supported by Unicode for various human languages. Although the focus of design is on documents, the language is also used widely for representing arbitrary structures like those that are used for web services.
There are different scheme stems in existence to help with defining XML-based language. Numerous application programming interfaces (APIs) have also been developed by programmers to help with processing data.
XML Applications – big eCommerce sites like shopify and amazon use it for developing
One hundred document formats that use syntax have been developed as of 2009, including XHTML, SOAP, Atom and RSS. For numerous office-productivity tools, including Apple’s iWork, LibreOffice (OpenDocument) and Microsoft Office (Office Open XML), XML-based formats have become the default. In addition, it has provided the base language for XMPP and other communication protocols. Files are used for configuration purposes for Microsoft .NET Framework applications. Apple also has a registry implementation that is based on it. Amazon also runs a registry based. Another site that implements it is the ecom success academy site. It is kind of hard to find a site that does not…
It is now commonly used for interchanging data over the Internet. Rules for constructing Internet Media Types are given by IETF RFC 7303 to use when XML is sent. The media types text/xml and application/sml are also defined by it, which only say that the data is in XML format, but doesn’t state anything about the semantics. Some have been critical of using text/xml as a possible source of encoding problems. Suggestions have been made to deprecate it.
It is also recommended by RFC 7303 that XML-based languages should be provided with media types that end in +xml. For instance for SVG, image/svg+xml.
The RFC 3470 has additional guidelines for using it within a network context. The document is also called IETF BCP 70 and covers many aspects related to the design and deployment of an language.
The material below is based on the Specification. It isn’t an exhaustive list of every construct appearing in it. Rather it introduces the key constructs that are encountered most often in daily use.
A string of characters is what an document is comprised of. It can contain nearly every legal Unicode character.
Processor and application
The markup is analyzed by the processor and then structured information is passed on to an application. Requirements are placed by the specification on the things that an XML processor must do as well as don’t do. However, the application is outside of the specification’s scope. The processor (as referred to by the specification) is often called an XML parser colloquially.
Markup and content
An XML document is made up of characters and these are divided into content and markup. Simple syntactic rules are applied to distinguish them. In general, strings constituting markup with either start with the character & and then end with ; or start with the character < and then end with the character >. Any strings of characters that aren’t markup are content. In a CDATA section, however, certain delimiters are considered to be markup, and then the text that is in between them is considered to be content. Whitespace that comes before and then after the outermost element is also considered to be markup.
A tag is a kind of markup construct that starts with < and then ends with >. There are three types of tags:
empty-element tag, like < line-break / >
start-tag, like < section >
end-tag, like < /section >
An element refers to a logical document component. It either consists of just an empty-element tag or starts with a start-tag and then ends with the matching end-tag. If there are any characters that are in between the start-tag and the end-tag, they are classified as the element’s content. They might contain markup, which can include other elements that are referred to as child elements.
An attribute is a type of markup construct that consists of a name-value pair that exists inside an empty-element tag or start-tag.
An XML attribute may have only one value and every attribute may appear once at the most on each element. Multiple values are commonly desired. In this situation, it has to be done through encoding the list into a very well-formed attribute using a format that goes beyond what uses for defining itself. This is usually either a semi-colon or comma delimited list, or a space-delimited list may be used when the individual values do not contain spaces.
An document can start with an XML declaration describing some information about itself.
Characters and escaping
An document is completely made up of Unicode characters. Any character that Unicode defines can appears within an document’s content, with the exception of a few control characters, which are specifically excluded.
Facilities are included in XML for identifying Unicode character encoding that make up a document, as well as for expressing characters that cannot be directly used for some reason.
The following ranges of Unicode code points are valid for 1.0 documents:
U+10000–U+10FFFF: includes all of the code points within supplementary planes, which includes non-characters;
U+0020–U+D7FF, U+E000–U+FFFD: excludes some non-characters (but not all) in the BMP (U+FFFE and U+FFFF, all surrogates, are forbidden);
U+000D (Carriage Return), U+000A (Line Feed), U+0009 (Horizontal Tab): the only CO controls that the 1.0 accepts.
The allowed character set is extended by XML 1.1 to include the remaining characters that are within the U+0001–U+001F range in addition to all of the above. However, at the same time, the use of C1 and C0 control characters are restricted except for U+0085 (Next Line), U+000D (Carriage Return), U+000A (Line Feed) and U+0009 (Horizontal Tab) through requiring that they be written within escaped form. For C1 characters, the restriction is backwards incompatibility. The reason that it was introduced was to make it possible to detect common encoding errors.
The only character that isn’t permitted in any XML 1.1 or 1.0 document is the U+0000 (Null) code point.
It is possible to encode the Unicode character set into bytes for transmission or storage purposes in various ways. These are referred to as “encodings.” Encodings are defined by Unicode to cover the complete repertoire. Some of the most familiar ones include UTF-16 and UTF-8. Numerous other text encodings also exist that predate Unicode, like ISO/IEC 8859 and ASCII. In nearly every case their character repertoires are Unicode character set subsets.
Any Unicode-defined encodings are allowed to be used, along with any other encodings with characters appearing in Unicode. A mechanism is also provided by where an XML process, without having any prior knowledge, can reliably determine the encoding that is being used. Other than UTF-16 and UTF-8, not every parser will recognize all other encodings.
Escape facilities are provided by XML for including characters that can be problematic to be directly included. For instance:
There are some characters with glyphs that can’t be distinguished visually from other characters, like the space and the non-breaking space or the Latin capital letter “A” and the Cyrillic capital letter “A”.
Typing the character on the writer’s machine may not be possible.
There are five predefined entities:
“<” represents “<“; “>” represents “>”;
“&” represents “&”;
“‘” represents “‘”;
“”” represents ‘”‘.
Some characters encodings only support a subset of the overall Unicode. For instance, in ASCII encoding an XML document is legal. However, code points for some Unicode characters are lacking in ASCII.
The characters “&” and “<” are key syntax markers and cannot ever appear in the content that is outside of a CDATA section.
A numeric character reference can be used to represent any of the permitted Unicode characters. If a user has a keyboard that doesn’t offer a way of typing a certain character, it can still be inserted into an XML document by using certain encoding. So for example, if a user doesn’t have a certain Chinese character appearing on their keyboard, they will still be able to encode it to appear in an XML document.
However, null characters are not permitted since they are a control character that is excluded from it. This is true, even when a numeric character reference is used. An alternative encoding mechanism must be used for representing these characters, such as Base64.
Outside other markup, comments can appear anyplace within a document. There can be no comments prior to the XML declaration. Comments start with . The string “–” (double-hyphen) isn’t allowed …