Parser

Return to Online Manual Contents

Nexus uses a couple different methods of parsing data.  To parse, is to attempt to make atomic, meaningful elements from an otherwise continuous stream of near random values... a web browser parses these sentences to format the words so they will all fit on the page, wrapping if needed.

Some definitions may be useful before this content.

  • Atom/Atomic - A set of characters that is 1(one) entire, seperate/separable, word.
  • Phrase - a regularly bound set of potential atoms.
  • white Space - a space character, a tab, and most times a carriage return.
  • The first method of parsing gathers data into lines.  Carriage return and new line characters are used to determine the end of lines.

    The second method, much more complex, processes text into words, and phrases.  A phrase is determined by matching sets of certain characters...

  • " " - Double quotation marks
  • ( ) - Parenthesis can be used to form a phrase, very common in many programming languages.
  • [ ] - square brackets can also be used to denote a phrase.
  • { } - curly brackets will also determine a quoted phrase.
  • < > - HTML type tag markings also denote a phrase.
  • ' ' - Potentially single quotation marks may be used, however, due to the abundant usage of abbreviations the (') character is considered a letter type character.  Perhaps at one point "'"s will be used for phrase parsing.
  • If any of these begin a phrase, and any other set is between, they will be gathered into the phrase... for example ""... will be a phrase, and can be further parsed into it's sub words...

    Spaces and punctuation split lines into word phrases.  Carriage return characters are treated as space, and end previously accumulated words just like spaces and tabs.  Most punctuation marks ( % / , ; ! ? = + * & ~ # @ ) end words, and will become atomic unto themselves. Period has the property that it can be part of an elipses (...) where more than one of the same punctuation character may be in the same atom. A period or colon, will only behave like a singlar punctuation mark when the next character is white space, or a '%'. This rule simplified collection of network address and filenames. A '%' always introduces a variable reference. A '-' when followed by a number will not be atomic unto itself, instead will be collected with a number, to allow parsing of negative numbers. If the '-' and numeric character are seperated by a space, the '-' will be atomic to itself and may have no relation to the number.

    Revision may be required to this page...