The module provides the convertMicrodataToObject() function that parses ONE object microdata
and returns a Javascript object representing the data. You can find examples in
<PROJECT_HOME>/test/microdata-to-object-test.js
Main export: Function ' convert'
The convert
function is the main export of this module, all other exports are for
low-level performance optimizations that an application probably should not deal with.
For example, this minimal Person object with just an email address is converted to the object below:
import {convertMicrodataToObject} from 'one.core/lib/microdata-to-object');
console.log(
convertMicrodataToObject(
'<div itemscope itemtype="//refin.io/Person">' +
'<span itemprop="email">foo@bar.com</span>' +
'<span itemprop="name">Foo Bar</span>' +
'</span>'
)
);
Output:
{
$type$: "Person",
email: "foo@bar.com",
name: "Foo Bar"
}
When parsing microdata the primary thing the parser looks at actually is not the microdata
string - it is the rules in object-recipes. For example,
the rules for the Person
object above could look like in the example below. Rules are
described in RecipeRule.
Person: [
{
// <span itemprop="email">foo@bar.com</span>
itemprop: 'email',
isId: true
},
{
// <span itemprop="name">Foo Bar</span>
itemprop: 'name'
}
],
When parsing microdata the parser iterates over the (ordered) sequence of rules. This tells the parser exactly what character(s) it should expect, and if the input string cannot be matched to the expectation the conversion fails with an error. There also is no room for any additional white-space or line-breaks, since every single unexpected character would change the SHA-256 hash of the microdata.
Other exports
In addition to the main conversion function the module also exports the converter's component function, each responsible for part of a microdata object, so that other modules can use them to parse sections of microdata without doing a full conversion of the entire object.
It is important to note that those functions were written first of all for this module, and that exposing them via exports was an afterthought - quite deliberately. You notice this in the list of parameters that they require, which are optimized for the beginning-to-end one-step parsing of an entire microdata ONE object without copying strings.
Each component function is always given the full microdata string plus a position
(number). Each function expects to find what it is told to look for at the exact position
that it was given! That means any outside function calling, for example,
parseValueByTheExpectedType
must first find the correct position on its own and only then
call parseValueByTheExpectedType
. This is not a disadvantage for performance, there is no way
around searching through the microdata to find that location in any case, and you still
bypass the full microdata-to-object converter which would parse every item up to that point.
For example, this is how you could extract a single value in a long microdata object:
The example uses a OneTest$KeyValueMap
object that could be defined using these rules
[
{ itemprop: 'name' },
{
itemprop: 'item',
list: 'orderedByApp',
rule: [
{
itemprop: 'key'
},
{
itemprop: 'value',
list: CoreTypes.ORDERED_BY.ONE
}
]
}
]
import * as MicrodataToObject from 'one.core/lib/microdata-to-object';
// The parser requires the recipes used to construct the ONE objects
import * as ObjectRecipes from './object-recipes';
// We need the rule that constructed the property we want - note that this is part of the
// nested "item" object in the OneTest$KeyValueMap recipe OneTest$KeyValueMap, defined above
const ValueRule = {
itemprop: 'value',
list: CoreTypes.ORDERED_BY.ONE
};
const microdata =
'<div itemscope itemtype="//refin.io/OneTest$KeyValueMap">',
'<span itemprop="name">MyKey => RandomText</span>' +
'<span itemprop="keyJsType">number</span>' +
'<span itemprop="valueJsType">string</span>' +
'<span itemprop="item">' +
'<span itemprop="key">someKey1</span>' +
'<span itemprop="value">Foo text for key #1</span>' +
'</span>' +
'<span itemprop="item">' +
'<span itemprop="key">someKey2</span>' +
'<span itemprop="value">Why are there so many bananas in the pool?</span>' +
'</span>' +
'<span itemprop="item">' +
'<span itemprop="key">someKey3</span>' +
'<span itemprop="value">This man\'s name: Einstein</span>' +
'</span>' +
'</span>';
const searchKey = 'someKey2';
const searchString = '<span itemprop="key">' + searchKey + '</span>';
// IMPORTANT: Position must be on the value following the key, because that is what we want!
// The parse function will fail if the string at the given position does not start with
// <span itemprop="value">
const position = microdata.indexOf(searchString) + searchString.length;
TODO fix example code
const {value} = MicrodataToObject.parseMicrodataByTheExpectedType(rule.itemtype, rule, CONTEXT)
//microdata, position, ValueRule);
console.log(value);
Output:
Why are there so many bananas in the pool?
- Source:
Members
(static, constant) CONVERSION_FUNCTIONS :object
The conversion functions allow us to have non-string types when working with data that is always saved as (microdata) string. Using type meta information stored in the rule-sets in object-recipes.js conversions from microdata (string) to the actual (Javascript) type are performed as part of the microdata-to-object conversion. The function is exported to be used by other modules. For example, the map-helper module uses it to convert dynamic types. Those are not set in the rule-sets but stored with the actual individual object (see map object definitions in the recipes).
Programmer note: The advantage of this vs. one big function with a switch/case statement is that the latter cannot be optimized by the JS runtime (compiled) because the types it returns keep changing all the time. The type signatures of these short functions on the other hand are stable.
The conversion functions each take one string argument of unescaped microdata (i.e. "<", ">" and "&" characters written as HTML entities), unescape the microdata string to a regular string with the escaped characters replaced by the actual characters, convert the string to the respective type, and return the result.
Type:
- object
Properties:
Name | Type | Description |
---|---|---|
string |
function | (string) => string |
boolean |
function | (string) => boolean |
number |
function | (string) => number |
regexp |
function | (string) => RegExp |
object |
function | (string) => Object |
Map |
function | (string) => Map |
Set |
function | (string) => Set |
- Source:
(static, constant) EXTRACT_FUNCTIONS :object
Type:
- object
Properties:
Name | Type | Description |
---|---|---|
primitive |
function | (CONTEXT, rule, isNested) => unknown |
reference |
function | (CONTEXT, rule, isNested) => unknown |
unorderedCollection |
function | (CONTEXT, rule, isNested) => unknown[] |
orderedArray |
function | (CONTEXT, rule, isNested) => unknown[] |
map |
function | (CONTEXT, rule, isNested) => unknown |
obj |
function | (CONTEXT, rule, isNested) => unknown |
- Source:
Methods
(static) extractMicrodataWithTag(html, position) → {Object}
Parameters:
Name | Type | Description |
---|---|---|
html |
string | |
position |
number |
TODO This function should not be necessary but use findClosingTag/() instead
- Source:
Returns:
Type: Object
(static) findClosingTag(html, position) → {number}
Parameters:
Name | Type | Description |
---|---|---|
html |
string | The HTML string to search |
position |
number | Start searching at this position |
Find the end tag for a given opening span tag, skipping over any nested tags. It finds the position just after the last character ">" of the closing tag.
Assumptions
- This function does not check the names of the tags, it only blindly counts "<" (in combination with a "/" for a closing tag) and ">" characters. This means that any opening tag should have a corresponding closing tag, which is the case for our microdata (plus, we only have span tags but that does not matter to this function).
- The only tag name assumption is that we are looking for a span tag, and it only matters because we assume a constant length of the closing tag that we found when returning the final position.
- This function relies on the starting position already being advanced past the initial "<" character of the opening tag
- Source:
Throws:
-
Throws an error if no closing tag could be found
-
Type: Error
Returns:
The position of the character just after the final ">" of the closing tag
Type: number
(static) breakMicrodataIntoArray(html, startingTag, endingTag, startingTagPosition, endingTagPosition) → {Array.<string>}
Parameters:
Name | Type | Description |
---|---|---|
html |
string | |
startingTag |
string | |
endingTag |
string | |
startingTagPosition |
number | |
endingTagPosition |
number |
The only concern here is that your microdata must have the same tag that describes the item. E.g. only ... or
Breaks the given array microdata into an array. Parsed by the given params (startStr, endStr). E.g for the following microdata:
<ol><li>1</li><li>2</li><ol>
The result will be:
[1,2]
- Source:
Returns:
Type: Array.<string>
(static) unescapeFromHtml(html) → {string}
Parameters:
Name | Type | Description |
---|---|---|
html |
string | A string that needs to be HTML-escaped |
Strings saved inside microdata had to have some special characters replaced. See
https://stackoverflow.com/a/7279035/544779.
This is the reverse of what we did when creating microdata strings in function
escapeHtml()
of module object-to-microdata
- Source:
Returns:
Returns the HTML-escaped string
Type: string
(inner) extractPrimitiveTypeFromMicrodata(CONTEXT, rule, isNested) → {unknown}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
CONTEXT |
ParseContext | ||
rule |
RecipeRule.itemprop | ||
isNested |
boolean | false |
Extracts any primitive type from the given microdata. PrimitiveValueTypes
- Source:
Returns:
Type: unknown
(inner) extractReferenceTypeFromMicrodata(CONTEXT, rule, referenceType, isNested) → {unknown}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
CONTEXT |
ParseContext | ||
rule |
RecipeRule.itemprop | ||
referenceType |
ReferenceValueTypes.type | ||
isNested |
boolean | false |
Extracts a SHA256 from the given microdata
- Source:
Returns:
Type: unknown
(inner) extractOrderedListTypeFromMicrodata(CONTEXT, rule, itemType, isNested) → {Array.<unknown>|undefined}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
CONTEXT |
ParseContext | ||
rule |
RecipeRule | ||
itemType |
StringValue | IntegerValue | NumberValue | BooleanValue | StringifiableValue | ReferenceToObjValue | ReferenceToIdValue | ReferenceToClobValue | ReferenceToBlobValue | MapValue | BagValue | ArrayValue | SetValue | ObjectValue | ||
isNested |
boolean | false |
Extracts an ordered list from the given microdata.
- Source:
Returns:
Type: Array.<unknown> | undefined
(inner) extractMapTypeFromMicrodata(CONTEXT, rule, valueType, isNested) → {Map.<unknown, unknown>}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
CONTEXT |
ParseContext | ||
rule |
RecipeRule.itemprop | ||
valueType |
MapValue | ||
isNested |
boolean | false |
Extracts map object from the given microdata.
- Source:
Returns:
Type: Map.<unknown, unknown>
(inner) extractUnorderedListTypeFromMicrodata(CONTEXT, rule, itemType, isNested) → {Array.<unknown>|undefined}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
CONTEXT |
ParseContext | ||
rule |
RecipeRule | ||
itemType |
StringValue | IntegerValue | NumberValue | BooleanValue | StringifiableValue | ReferenceToObjValue | ReferenceToIdValue | ReferenceToClobValue | ReferenceToBlobValue | MapValue | BagValue | ArrayValue | SetValue | ObjectValue | ||
isNested |
boolean | false |
Extracts an unordered list from the given microdata.
- Source:
Returns:
Type: Array.<unknown> | undefined
(inner) extractObjectTypeFromMicrodata(CONTEXT, rule, valueType) → {unknown}
Parameters:
Name | Type | Description |
---|---|---|
CONTEXT |
ParseContext | |
rule |
RecipeRule | |
valueType |
ObjectValue |
Extracts an object from the given microdata.
- Source:
Returns:
Type: unknown
(inner) parseMicrodataByTheExpectedType(valueType, rule, CONTEXT, isNested) → {unknown}
Parameters:
Name | Type | Default | Description |
---|---|---|---|
valueType |
ValueType | ||
rule |
RecipeRule | ||
CONTEXT |
ParseContext | ||
isNested |
boolean | false |
- Source:
Returns:
Type: unknown
(static) parseData(type, rules, CONTEXT) → {OneObjectTypes}
Parameters:
Name | Type | Description |
---|---|---|
type |
OneObjectTypeNames | The type was already found by the caller. This function expects it so that it can insert it into the returned object before any data properties so that in the iteration order of Javascript objects implicitly set (for non-numerical properties) through insertion order it gets the first spot. This is for human readers of raw data output, the code does not care. |
rules |
Array.<RecipeRule> | An array of rules corresponding to all rules for a given ONE object type from ONE object recipes |
CONTEXT |
ParseContext |
Parses only the inner data part of a ONE microdata object, i.e. the outer frame opening span tag has already been parsed and the HTML starting at the given position in only contains span tags with actual data. This function continues until all rules are exhausted and the last rule ended up finding no matching string (i.e. no matching data value in a span tag with the expected itemprop name).
- Source:
Throws:
Returns:
Type: OneObjectTypes
(static) parseHeader(expectedTypeopt, CONTEXT) → {OneObjectTypeNames}
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
expectedType |
Set.<(OneObjectTypeNames|"*")> |
<optional> |
Expect certain type strings, or '*' if we should accept any ONE object type that we have a recipe for. For included sub-objects this is set to the Set object of the recipe rule, for top-level objects this is set by the caller. |
CONTEXT |
ParseContext |
This function only parses the opening enclosing "itemscope
" span tag, extracts the type
string and optionally compares it to a given expected type, and advances the position counter
to the next character after the tag it was responsible for parsing.
Example
<div itemscope itemtype="//refin.io/Person">
leads to a return value of
{value: 'Person', position: 51}
Note that while convert
takes a single string this function expects a Set
object
to make it more flexible to accommodate parseObject
, which also uses a Set
that it in
turn receives from the recipe used to parse a *sub-*object.
- Source:
Throws:
-
Throws errors if the HTML could not be parsed, and also if the parsed type string is not a known ONE object type name
-
Type: Error
Returns:
Type: OneObjectTypeNames
(static) parseObject(expectedType, CONTEXT) → {undefined|OneObjectTypes}
Parameters:
Name | Type | Description |
---|---|---|
expectedType |
Set.<(OneObjectTypeNames|"*")> | Expect certain type strings, or '*' if we should accept any ONE object type that we have a recipe for. For included sub-objects this is set to the Set object of the recipe rule, for top-level objects this is set by the caller. |
CONTEXT |
ParseContext |
Parses a complete html object including the outer frame that contains the type. There are two parts: First the opening outer span tag is parsed for the type information, then the html inside the outer frame is parsed for the data. The type string is used to find the rule-set to use for parsing - the order of rules determines the order data properties are expected. When the function is done there should be exactly the closing span tag of the outer frame left unparsed (of the current object - if this is an inner/included object).
- parseObject() it the top-level function to parse microdata into objects, determining the type and creating the object that is going to be returned.
- parseData() is the next-level function, parsing all the actual data of an object. It yields the object on the "data" property of the returned object.
- parseMicrodataByTheExpectedType() is the third-level function, parsing individual values, which can also be arrays. It tries to apply a rule from ObjectRecipes.getRecipe(type).rule to the HTML until there either is no more HTML left or the remaining HTML does not fit the rule it is working on.
Note that while convert
takes a single string this function expects a Set
object
because it needs to be more flexible: It uses the recipe rule's type
property which is a
Set
object.
- Source:
Throws:
Returns:
Type: undefined | OneObjectTypes
(static) convertMicrodataToObject(html, expectedTypeopt) → {OneObjectTypes}
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
html |
string | One object in HTML (MicroData) representation |
|
expectedType |
OneObjectTypeNames | Array.<OneObjectTypeNames> |
<optional> |
An optional expected
type, or an array of expected type names, which when not matched by the microdata leads to a
|
Convert the microdata representation of a ONE object to Javascript using the rules in object-recipes.js. An exception is thrown if there is a problem during the conversion.
Parsing has been optimized to go through the microdata string only once. That means we will proceed only forward and never look ahead, for example look for an end-tag and then go back to parse what is in between.
Another optimization is that the original HTML string is kept unaltered and no new strings containing parts of the original string are created. Instead, we keep track of the ever-advancing position that our parsing has reached. Each sub-function returns 1) its result and 2) the new position within the original string that has been reached successfully.
The only exception are - by necessity - the actual values gained from parsing the string. We have to use one of the high-level Javascript methods (here: String.prototype.slice) without knowing how it is implemented in the respective Javascript runtime (and version).
While V8 (and probably other JS engines too) have an internal optimized representation of sub-strings using pointers we don't want to rely on that. For some background see: http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html
If there is any discrepancy between what we expect and what we find the respective function throws an exception immediately. This means the exception-free code path does not need any checks, if the code continues to run we know everything is fine.
- Source:
Throws:
Returns:
Returns the Javascript object version of the parsed microdata
Type: OneObjectTypes
(static) convertIdMicrodataToObject(html, expectedTypeopt) → {OneObjectTypes}
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
html |
string | One object in HTML (MicroData) representation |
|
expectedType |
OneObjectTypeNames | Array.<OneObjectTypeNames> |
<optional> |
An optional expected
type or an array of expected type names which when not matched by the microdata leads to a
|
See convertMicrodataToObject. This function is the same except it expects ID object
microdata.with an extra ID object attribute data-id-object="true"
in the outer <span>
.
This is an extra function because ID objects are not valid ONE objects unless all non-ID
properties are optional. They also serve a different purpose, instead of being used to store
data they are used to point to ONE objects that are versions under the ID object.
We keep ID objects only to be able to be able to tell, given an ID hash, which (ID) properties
created it.
Also, while microdata of ID objects is different from microdata of a ONE object that has the exact same type and properties (possible if all non-ID properties are optional), there is no Javascript object format for ID objects. That is because ID objects are created on the fly from regular ONE objects and are ephemeral, only used to create an ID hash.
That means the objects returned by this function look like ONE objects, but rarely are
(only when all non-ID properties of the $type$
are optional).
- Source:
Throws:
Returns:
Returns the Javascript object version of the parsed microdata
Type: OneObjectTypes