Module: microdata-to-object

The module provides the convertMicrodataToObject() function that parses ONE object microdata and returns a Javascript object representing the data. You can find examples in <PROJECT_HOME>/test/microdata-to-object-test.js

Main export: Function ' convert'

The convert function is the main export of this module, all other exports are for low-level performance optimizations that an application probably should not deal with.

For example, this minimal Person object with just an email address is converted to the object below:

import {convertMicrodataToObject} from 'one.core/lib/microdata-to-object');

console.log(
  convertMicrodataToObject(
    '<div itemscope itemtype="//refin.io/Person">' +
      '<span itemprop="email">foo@bar.com</span>' +
      '<span itemprop="name">Foo Bar</span>' +
    '</span>'
   )
 );

Output:

{
  $type$: "Person",
  email: "foo@bar.com",
  name: "Foo Bar"
}

When parsing microdata the primary thing the parser looks at actually is not the microdata string - it is the rules in object-recipes. For example, the rules for the Person object above could look like in the example below. Rules are described in RecipeRule.

Person: [
  {
    // <span itemprop="email">foo@bar.com</span>
    itemprop: 'email',
    isId: true
  },
  {
    // <span itemprop="name">Foo Bar</span>
    itemprop: 'name'
  }
],

When parsing microdata the parser iterates over the (ordered) sequence of rules. This tells the parser exactly what character(s) it should expect, and if the input string cannot be matched to the expectation the conversion fails with an error. There also is no room for any additional white-space or line-breaks, since every single unexpected character would change the SHA-256 hash of the microdata.

Other exports

In addition to the main conversion function the module also exports the converter's component function, each responsible for part of a microdata object, so that other modules can use them to parse sections of microdata without doing a full conversion of the entire object.

It is important to note that those functions were written first of all for this module, and that exposing them via exports was an afterthought - quite deliberately. You notice this in the list of parameters that they require, which are optimized for the beginning-to-end one-step parsing of an entire microdata ONE object without copying strings.

Each component function is always given the full microdata string plus a position (number). Each function expects to find what it is told to look for at the exact position that it was given! That means any outside function calling, for example, parseValueByTheExpectedType must first find the correct position on its own and only then call parseValueByTheExpectedType. This is not a disadvantage for performance, there is no way around searching through the microdata to find that location in any case, and you still bypass the full microdata-to-object converter which would parse every item up to that point.

For example, this is how you could extract a single value in a long microdata object:

The example uses a OneTest$KeyValueMap object that could be defined using these rules

[
   { itemprop: 'name' },
   {
       itemprop: 'item',
       list: 'orderedByApp',
       rule: [
         {
           itemprop: 'key'
         },
         {
           itemprop: 'value',
           list: CoreTypes.ORDERED_BY.ONE
         }
       ]
   }
]
import * as MicrodataToObject from 'one.core/lib/microdata-to-object';

// The parser requires the recipes used to construct the ONE objects
import * as ObjectRecipes from './object-recipes';

// We need the rule that constructed the property we want - note that this is part of the
// nested "item" object in the OneTest$KeyValueMap recipe OneTest$KeyValueMap, defined above
const ValueRule = {
  itemprop: 'value',
  list: CoreTypes.ORDERED_BY.ONE
};

const microdata =
'<div itemscope itemtype="//refin.io/OneTest$KeyValueMap">',
  '<span itemprop="name">MyKey => RandomText</span>' +
  '<span itemprop="keyJsType">number</span>' +
  '<span itemprop="valueJsType">string</span>' +
  '<span itemprop="item">' +
    '<span itemprop="key">someKey1</span>' +
    '<span itemprop="value">Foo text for key #1</span>' +
  '</span>' +
  '<span itemprop="item">' +
    '<span itemprop="key">someKey2</span>' +
    '<span itemprop="value">Why are there so many bananas in the pool?</span>' +
  '</span>' +
  '<span itemprop="item">' +
    '<span itemprop="key">someKey3</span>' +
    '<span itemprop="value">This man\'s name: Einstein</span>' +
  '</span>' +
'</span>';

const searchKey = 'someKey2';
const searchString = '<span itemprop="key">' + searchKey + '</span>';

// IMPORTANT: Position must be on the value following the key, because that is what we want!
// The parse function will fail if the string at the given position does not start with
// <span itemprop="value">
const position = microdata.indexOf(searchString) + searchString.length;

TODO fix example code
const {value} = MicrodataToObject.parseMicrodataByTheExpectedType(rule.itemtype, rule, CONTEXT)
//microdata, position, ValueRule);

console.log(value);

Output:

Why are there so many bananas in the pool?
Source:

Members

(static, constant) CONVERSION_FUNCTIONS :object

The conversion functions allow us to have non-string types when working with data that is always saved as (microdata) string. Using type meta information stored in the rule-sets in object-recipes.js conversions from microdata (string) to the actual (Javascript) type are performed as part of the microdata-to-object conversion. The function is exported to be used by other modules. For example, the map-helper module uses it to convert dynamic types. Those are not set in the rule-sets but stored with the actual individual object (see map object definitions in the recipes).

Programmer note: The advantage of this vs. one big function with a switch/case statement is that the latter cannot be optimized by the JS runtime (compiled) because the types it returns keep changing all the time. The type signatures of these short functions on the other hand are stable.

The conversion functions each take one string argument of unescaped microdata (i.e. "<", ">" and "&" characters written as HTML entities), unescape the microdata string to a regular string with the escaped characters replaced by the actual characters, convert the string to the respective type, and return the result.

Type:
  • object
Properties:
Name Type Description
string function

(string) => string

boolean function

(string) => boolean

number function

(string) => number

regexp function

(string) => RegExp

object function

(string) => Object

Map function

(string) => Map

Set function

(string) => Set

Source:

(static, constant) EXTRACT_FUNCTIONS :object

Type:
  • object
Properties:
Name Type Description
primitive function

(CONTEXT, rule, isNested) => unknown

reference function

(CONTEXT, rule, isNested) => unknown

unorderedCollection function

(CONTEXT, rule, isNested) => unknown[]

orderedArray function

(CONTEXT, rule, isNested) => unknown[]

map function

(CONTEXT, rule, isNested) => unknown

obj function

(CONTEXT, rule, isNested) => unknown

Source:

Methods

(static) extractMicrodataWithTag(html, position) → {Object}

Parameters:
Name Type Description
html string
position number

TODO This function should not be necessary but use findClosingTag/() instead

Source:
Returns:

Type: Object

(static) findClosingTag(html, position) → {number}

Parameters:
Name Type Description
html string

The HTML string to search

position number

Start searching at this position

Find the end tag for a given opening span tag, skipping over any nested tags. It finds the position just after the last character ">" of the closing tag.

Assumptions

  • This function does not check the names of the tags, it only blindly counts "<" (in combination with a "/" for a closing tag) and ">" characters. This means that any opening tag should have a corresponding closing tag, which is the case for our microdata (plus, we only have span tags but that does not matter to this function).
  • The only tag name assumption is that we are looking for a span tag, and it only matters because we assume a constant length of the closing tag that we found when returning the final position.
  • This function relies on the starting position already being advanced past the initial "<" character of the opening tag
Source:
Throws:

Throws an error if no closing tag could be found

Type: Error

Returns:

The position of the character just after the final ">" of the closing tag

Type: number

(static) breakMicrodataIntoArray(html, startingTag, endingTag, startingTagPosition, endingTagPosition) → {Array.<string>}

Parameters:
Name Type Description
html string
startingTag string
endingTag string
startingTagPosition number
endingTagPosition number

The only concern here is that your microdata must have the same tag that describes the item. E.g. only ... or

  • ...
  • etc

    Breaks the given array microdata into an array. Parsed by the given params (startStr, endStr). E.g for the following microdata:

     <ol><li>1</li><li>2</li><ol>
    

    The result will be:

    [1,2]
    
    Source:
    Returns:

    Type: Array.<string>

    (static) unescapeFromHtml(html) → {string}

    Parameters:
    Name Type Description
    html string

    A string that needs to be HTML-escaped

    Strings saved inside microdata had to have some special characters replaced. See https://stackoverflow.com/a/7279035/544779. This is the reverse of what we did when creating microdata strings in function escapeHtml() of module object-to-microdata

    Source:
    Returns:

    Returns the HTML-escaped string

    Type: string

    (inner) extractPrimitiveTypeFromMicrodata(CONTEXT, rule, isNested) → {unknown}

    Parameters:
    Name Type Default Description
    CONTEXT ParseContext
    rule RecipeRule.itemprop
    isNested boolean false

    Extracts any primitive type from the given microdata. PrimitiveValueTypes

    Source:
    Returns:

    Type: unknown

    (inner) extractReferenceTypeFromMicrodata(CONTEXT, rule, referenceType, isNested) → {unknown}

    Parameters:
    Name Type Default Description
    CONTEXT ParseContext
    rule RecipeRule.itemprop
    referenceType ReferenceValueTypes.type
    isNested boolean false

    Extracts a SHA256 from the given microdata

    Source:
    Returns:

    Type: unknown

    (inner) extractOrderedListTypeFromMicrodata(CONTEXT, rule, itemType, isNested) → {Array.<unknown>|undefined}

    Parameters:
    Name Type Default Description
    CONTEXT ParseContext
    rule RecipeRule
    itemType StringValue | IntegerValue | NumberValue | BooleanValue | StringifiableValue | ReferenceToObjValue | ReferenceToIdValue | ReferenceToClobValue | ReferenceToBlobValue | MapValue | BagValue | ArrayValue | SetValue | ObjectValue
    isNested boolean false

    Extracts an ordered list from the given microdata.

    Source:
    Returns:

    Type: Array.<unknown> | undefined

    (inner) extractMapTypeFromMicrodata(CONTEXT, rule, valueType, isNested) → {Map.<unknown, unknown>}

    Parameters:
    Name Type Default Description
    CONTEXT ParseContext
    rule RecipeRule.itemprop
    valueType MapValue
    isNested boolean false

    Extracts map object from the given microdata.

    Source:
    Returns:

    Type: Map.<unknown, unknown>

    (inner) extractUnorderedListTypeFromMicrodata(CONTEXT, rule, itemType, isNested) → {Array.<unknown>|undefined}

    Parameters:
    Name Type Default Description
    CONTEXT ParseContext
    rule RecipeRule
    itemType StringValue | IntegerValue | NumberValue | BooleanValue | StringifiableValue | ReferenceToObjValue | ReferenceToIdValue | ReferenceToClobValue | ReferenceToBlobValue | MapValue | BagValue | ArrayValue | SetValue | ObjectValue
    isNested boolean false

    Extracts an unordered list from the given microdata.

    Source:
    Returns:

    Type: Array.<unknown> | undefined

    (inner) extractObjectTypeFromMicrodata(CONTEXT, rule, valueType) → {unknown}

    Parameters:
    Name Type Description
    CONTEXT ParseContext
    rule RecipeRule
    valueType ObjectValue

    Extracts an object from the given microdata.

    Source:
    Returns:

    Type: unknown

    (inner) parseMicrodataByTheExpectedType(valueType, rule, CONTEXT, isNested) → {unknown}

    Parameters:
    Name Type Default Description
    valueType ValueType
    rule RecipeRule
    CONTEXT ParseContext
    isNested boolean false
    Source:
    Returns:

    Type: unknown

    (static) parseData(type, rules, CONTEXT) → {OneObjectTypes}

    Parameters:
    Name Type Description
    type OneObjectTypeNames

    The type was already found by the caller. This function expects it so that it can insert it into the returned object before any data properties so that in the iteration order of Javascript objects implicitly set (for non-numerical properties) through insertion order it gets the first spot. This is for human readers of raw data output, the code does not care.

    rules Array.<RecipeRule>

    An array of rules corresponding to all rules for a given ONE object type from ONE object recipes

    CONTEXT ParseContext

    Parses only the inner data part of a ONE microdata object, i.e. the outer frame opening span tag has already been parsed and the HTML starting at the given position in only contains span tags with actual data. This function continues until all rules are exhausted and the last rule ended up finding no matching string (i.e. no matching data value in a span tag with the expected itemprop name).

    Source:
    Throws:
    Error
    Returns:

    Type: OneObjectTypes

    (static) parseHeader(expectedTypeopt, CONTEXT) → {OneObjectTypeNames}

    Parameters:
    Name Type Attributes Description
    expectedType Set.<(OneObjectTypeNames|"*")> <optional>

    Expect certain type strings, or '*' if we should accept any ONE object type that we have a recipe for. For included sub-objects this is set to the Set object of the recipe rule, for top-level objects this is set by the caller.

    CONTEXT ParseContext

    This function only parses the opening enclosing "itemscope" span tag, extracts the type string and optionally compares it to a given expected type, and advances the position counter to the next character after the tag it was responsible for parsing.

    Example

    <div itemscope itemtype="//refin.io/Person">
    

    leads to a return value of

    {value: 'Person', position: 51}
    

    Note that while convert takes a single string this function expects a Set object to make it more flexible to accommodate parseObject, which also uses a Set that it in turn receives from the recipe used to parse a *sub-*object.

    Source:
    Throws:

    Throws errors if the HTML could not be parsed, and also if the parsed type string is not a known ONE object type name

    Type: Error

    Returns:

    Type: OneObjectTypeNames

    (static) parseObject(expectedType, CONTEXT) → {undefined|OneObjectTypes}

    Parameters:
    Name Type Description
    expectedType Set.<(OneObjectTypeNames|"*")>

    Expect certain type strings, or '*' if we should accept any ONE object type that we have a recipe for. For included sub-objects this is set to the Set object of the recipe rule, for top-level objects this is set by the caller.

    CONTEXT ParseContext

    Parses a complete html object including the outer frame that contains the type. There are two parts: First the opening outer span tag is parsed for the type information, then the html inside the outer frame is parsed for the data. The type string is used to find the rule-set to use for parsing - the order of rules determines the order data properties are expected. When the function is done there should be exactly the closing span tag of the outer frame left unparsed (of the current object - if this is an inner/included object).

    • parseObject() it the top-level function to parse microdata into objects, determining the type and creating the object that is going to be returned.
    • parseData() is the next-level function, parsing all the actual data of an object. It yields the object on the "data" property of the returned object.
    • parseMicrodataByTheExpectedType() is the third-level function, parsing individual values, which can also be arrays. It tries to apply a rule from ObjectRecipes.getRecipe(type).rule to the HTML until there either is no more HTML left or the remaining HTML does not fit the rule it is working on.

    Note that while convert takes a single string this function expects a Set object because it needs to be more flexible: It uses the recipe rule's type property which is a Set object.

    Source:
    Throws:
    Error
    Returns:

    Type: undefined | OneObjectTypes

    (static) convertMicrodataToObject(html, expectedTypeopt) → {OneObjectTypes}

    Parameters:
    Name Type Attributes Description
    html string

    One object in HTML (MicroData) representation

    expectedType OneObjectTypeNames | Array.<OneObjectTypeNames> <optional>

    An optional expected type, or an array of expected type names, which when not matched by the microdata leads to a Error when attempting to parse the microdata. Leaving this parameter undefined or setting it to '*' disables the type check.

    Convert the microdata representation of a ONE object to Javascript using the rules in object-recipes.js. An exception is thrown if there is a problem during the conversion.

    Parsing has been optimized to go through the microdata string only once. That means we will proceed only forward and never look ahead, for example look for an end-tag and then go back to parse what is in between.

    Another optimization is that the original HTML string is kept unaltered and no new strings containing parts of the original string are created. Instead, we keep track of the ever-advancing position that our parsing has reached. Each sub-function returns 1) its result and 2) the new position within the original string that has been reached successfully.

    The only exception are - by necessity - the actual values gained from parsing the string. We have to use one of the high-level Javascript methods (here: String.prototype.slice) without knowing how it is implemented in the respective Javascript runtime (and version).

    While V8 (and probably other JS engines too) have an internal optimized representation of sub-strings using pointers we don't want to rely on that. For some background see: http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html

    If there is any discrepancy between what we expect and what we find the respective function throws an exception immediately. This means the exception-free code path does not need any checks, if the code continues to run we know everything is fine.

    Source:
    Throws:
    Error | Error
    Returns:

    Returns the Javascript object version of the parsed microdata

    Type: OneObjectTypes

    (static) convertIdMicrodataToObject(html, expectedTypeopt) → {OneObjectTypes}

    Parameters:
    Name Type Attributes Description
    html string

    One object in HTML (MicroData) representation

    expectedType OneObjectTypeNames | Array.<OneObjectTypeNames> <optional>

    An optional expected type or an array of expected type names which when not matched by the microdata leads to a Error when attempting to parse the microdata. Leaving this parameter undefined or setting it to '*' disables the type check.

    See convertMicrodataToObject. This function is the same except it expects ID object microdata.with an extra ID object attribute data-id-object="true" in the outer <span>. This is an extra function because ID objects are not valid ONE objects unless all non-ID properties are optional. They also serve a different purpose, instead of being used to store data they are used to point to ONE objects that are versions under the ID object. We keep ID objects only to be able to be able to tell, given an ID hash, which (ID) properties created it.

    Also, while microdata of ID objects is different from microdata of a ONE object that has the exact same type and properties (possible if all non-ID properties are optional), there is no Javascript object format for ID objects. That is because ID objects are created on the fly from regular ONE objects and are ephemeral, only used to create an ID hash.

    That means the objects returned by this function look like ONE objects, but rarely are (only when all non-ID properties of the $type$ are optional).

    Source:
    Throws:
    Error | Error
    Returns:

    Returns the Javascript object version of the parsed microdata

    Type: OneObjectTypes