Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing parse utils #3

Closed
pi0 opened this issue Feb 15, 2024 · 11 comments · Fixed by #4
Closed

Exposing parse utils #3

pi0 opened this issue Feb 15, 2024 · 11 comments · Fixed by #4

Comments

@pi0
Copy link
Collaborator

pi0 commented Feb 15, 2024

Hi. I quickly made this tracker issue while writing unjs/automd#32 to see if you are interested to also expose a simple parse util? (could be either stream or returning whole AST). This can be used as parser core in unjs/omark ❤️

@ije
Copy link
Owner

ije commented Feb 15, 2024

the md4c parser is pretty simple, it just receives 5 hooks:

  • enter_block: returns block type and details(ie, heading level)
  • leave_block: block end
  • enter_span: returns span type and details(ie, link href)
  • leave_span: span end
  • text: inner text

basically we can create these hooks in host, although calling js functions in wasm module is not ideal, but yes i think i can do it. just don't know are these hooks enough for omark's goal?

@pi0
Copy link
Collaborator Author

pi0 commented Feb 15, 2024

I am thinking of the fastest method to resolve the traversed MD tree so omark can make a simplified interface on top of it.

We might try to benchmark two methods:

  • Calling hooks in js every time a hook is called
  • Construct tree (in native code) and finally call the js method once fully traversed

Please let me know if you like me to try or like to compare yourself 👍🏼

@ije
Copy link
Owner

ije commented Feb 15, 2024

i perfer using construct tree, how about md to jsx-likes tree?

# Jobs
Stay _foolish_, stay **hungry**!
[https://apple.com](Apple)
<a href="https://apple.com">Apple</a>
[
  {type: 'h1', children: ['Jobs']},
  {type: 'p', children: [
    'Stay ',
    {type: "em", children: ["foolish"]},
    ', stay ',
    {type: "strong", children: ["hungry"]},
    '!',
    {type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
    {type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
  ]}
]

@pi0
Copy link
Collaborator Author

pi0 commented Feb 15, 2024

Honestly, for omark, I am considering a flattened array of streamable data (to make markdown ASTs as simple as possible) + and some alternative ways of nesting.

If you prefer a nested tree like other parsers there is no problem we can always convert 👍🏼

@ije
Copy link
Owner

ije commented Feb 15, 2024

how the flattened array looks like?

@ije
Copy link
Owner

ije commented Feb 15, 2024

how about splitting by blocks? this should work as streamable data

--- chunk 1
{type: 'h1', children: ['Jobs']}
--- chunk 2
{type: 'p', children: [
  'Stay ',
  {type: "em", children: ["foolish"]},
  ', stay ',
  {type: "strong", children: ["hungry"]},
  '!',
  {type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
  {type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}

or use array instead of object:

--- chunk 1
['h1', ['Jobs']]
--- chunk 2
['p', [
  'Stay ',
  ["em", ["foolish"]],
  ', stay ',
  ["strong", ["hungry"]],
  '!',
  ['a', {href: 'https://apple.com'}, ['Apple']],
  ['html', {html: '<a href="https://apple.com">Apple</a>'}, []]
]]

@pi0
Copy link
Collaborator Author

pi0 commented Feb 15, 2024

Yes, exactly I am thinking about splitting by logical blocks. But tricky to represent (still thinking how). Mainly I am considering using a Proxy that can access each block either as a stringified value or to be traversed individually. (why? because many use cases of tools simply require the high level representation of markdown AST not details) Something like this:

[
  "Jobs", // .{ type: 'h1', contents: <Proxy>[p:stay foolish..a:apple] }
  "Stay foolish, stay hungry!", // .{ type: 'p', contents: <Proxy>[.stay, em: ...] }
  "Apple" // .{ type: 'a', contents: <Proxy>[apple] }
]

I would love to together brainstorm on this possibility once there! I think for first step we need the parsed AST and I have high hopes to rely on md4w is promised before since it is native an minimal! If you are good with first proposal, #3 (comment) I think we can do it from there.

@ije
Copy link
Owner

ije commented Feb 15, 2024

sounds cool! i will try to implement a mdToJson function for a start.

@pi0 pi0 mentioned this issue Feb 15, 2024
14 tasks
@pi0
Copy link
Collaborator Author

pi0 commented Feb 15, 2024

I just made a quick wrapper that results (almost) same as your proposed object in omark so we can work in parallel.

The object is meant for internal purposes only and I can happily adjust to what you finally provide but also would love to have your 👍🏼 on unjs/mdbox#15 if you have few minutes to check so we are safe to go.

@ije
Copy link
Owner

ije commented Feb 16, 2024

thanks

@ije ije mentioned this issue Feb 16, 2024
@ije
Copy link
Owner

ije commented Feb 16, 2024

@pi0 #4 the first test has passed(not finished, can't handle the nesting blocks/spans yet)

@ije ije closed this as completed in #4 Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants