Native Parser

Reference Materials

Switch parser to SWC and introduce native/WASM code

Compared to javascript, the rust native language inherently possesses powerful performance capabilities. rollup has decided to switch from the javascript-side acorn parser to the rust-side swc parser, which has the ability to efficiently parse complex ast. This serves as a core change in rollup v4.

`Challenges`

`Native Interaction`

Directly using swc's javascript reference and parsing complex ast through the swc.parse javascript interface would incur significant communication overhead.

import swc from '@swc/core';

const code = `
  const a = 1;
  function add(a, b) {
    return a + b;
  }
`;
swc
  .parse(code, {
    syntax: 'ecmascript',
    comments: false,
    script: true,
    target: 'es3',
    isModule: false
  })
  .then(module => {
    module.type; // file type
    module.body; // AST
  });

Through swc's source code, it can be found that swc internally uses the serde_json library to serialize the parsed program object into a JSON string, which is then passed to the javascript side.

rust

#[napi]
impl Task for ParseTask {
  type JsValue = String;
  type Output = String;

  fn compute(&mut self) -> napi::Result<Self::Output> {
    let options: ParseOptions = deserialize_json(&self.options)?;
    let fm = self
      .c
      .cm
      .new_source_file(self.filename.clone().into(), self.src.clone());

    let comments = if options.comments {
      Some(self.c.comments() as &dyn Comments)
    } else {
      None
    };

    let program = try_with(self.c.cm.clone(), false, ErrorFormat::Normal, |handler| {
      let mut p = self.c.parse_js(
        fm,
        handler,
        options.target,
        options.syntax,
        options.is_module,
        comments,
      )?;

      p.visit_mut_with(&mut resolver(
        Mark::new(),
        Mark::new(),
        options.syntax.typescript(),
      ));

      Ok(p)
    })
    .convert_err()?;

    let ast_json = serde_json::to_string(&program)?;

    Ok(ast_json)
  }

  fn resolve(&mut self, _env: Env, result: Self::Output) -> napi::Result<Self::JsValue> {
    Ok(result)
  }
}

The javascript interface side then deserializes the ast string returned by the native parser into a javascript object through JSON.parse.

class Compiler {
  async parse(
    src: string,
    options?: ParseOptions,
    filename?: string
  ): Promise<Program> {
    options = options || { syntax: 'ecmascript' };
    options.syntax = options.syntax || 'ecmascript';

    if (!bindings && !!fallbackBindings) {
      throw new Error(
        'Fallback bindings does not support this interface yet.'
      );
    } else if (!bindings) {
      throw new Error('Bindings not found.');
    }

    if (bindings) {
      const res = await bindings.parse(src, toBuffer(options), filename);
      return JSON.parse(res);
    } else if (fallbackBindings) {
      return fallbackBindings.parse(src, options);
    }
    throw new Error('Bindings not found.');
  }
}

Between rust and javascript, repeatedly serializing (rust side) and deserializing (javascript side) the ast would almost completely erode the performance advantage of switching to the native parser (rust) when parsing complex ast.

`Ast Compatibility`

Even with the estree compat module, swc still produces babel ast, not estree ast. However, rollup depends on standard estree ast.

`File Encoding`

swc uses utf-8 encoding, while rollup depends on standard javascript's utf-16 encoding.

utf-8 and utf-16 are two different character encoding methods used to represent characters in text. Their main differences lie in the number of bytes used per character and the encoding method.

Differences between utf-8 and utf-16

utf-8:

Variable Length Encoding:

utf-8 uses 1 ~ 4 bytes to represent a character. ascii characters (such as English letters and numbers) use 1 byte, while other characters (such as Chinese characters) may use 2 ~ 4 bytes.

1 byte: ascii characters (U+0000 to U+007F).
2 bytes: Extended Latin characters (U+0080 to U+07FF).
3 bytes: Basic Multilingual Plane (BMP) characters (U+0800 to U+FFFF).
4 bytes: Supplementary Plane characters (U+10000 to U+10FFFF).

Backward Compatible with ascii:

Since ascii characters only occupy 1 byte in utf-8, utf-8 is fully compatible with ascii encoding.

Encoding Efficiency:

High efficiency for English and ASCII text (1 byte per character).
For non-Latin characters (such as Chinese, Japanese, etc.), typically requires 3 bytes.
For supplementary plane characters (such as emojis), requires 4 bytes.

Use Cases:

More suitable for network transmission and storage, especially for text primarily in ascii.
Commonly used in web pages, json files, and other scenarios.

utf-16:

Fixed or Variable Length Encoding:

utf-16 typically uses 2 bytes to represent most commonly used characters, but for certain special characters (such as emojis), it may require 4 bytes.

2 bytes: Characters within the BMP range (U+0000 to U+FFFF, excluding surrogate pairs).
4 bytes: Characters beyond the BMP (U+10000 to U+10FFFF), using two 16-bit units (called surrogate pairs).

Not Compatible with ascii:

UTF-16 is not compatible with ascii because ascii characters require 2 bytes in UTF-16. However, both utf-8 and utf-16 can treat each character of ascii as one unit.

Encoding Efficiency:

High efficiency for characters within the BMP range (such as most Chinese, Japanese) (2 bytes per character).
Low efficiency for ASCII characters (2 bytes per character).
Similar efficiency to UTF-8 for supplementary plane characters (requires 4 bytes).

Use Cases:

More suitable for memory operations, especially in scenarios primarily using BMP range characters (such as Chinese environments).
Commonly used in internal character representation for windows, javascript, and java.

Example Assumption:

For the string A你, the encoding results are as follows.

UTF-8 Encoding:

"A": 1 byte, encoded as 0x41

"你": 3 bytes, encoded as 0xE4BDA0

UTF-16 Encoding:

"A": 2 bytes, encoded as 0x0041

"你": 2 bytes, encoded as 0x4F60

Character positions in utf-8 are byte-based, while in utf-16 they are based on 2-byte units.

Summary:

Feature	utf-8	utf-16
Encoding Length	1-4 bytes	2 or 4 bytes
ascii Compatibility	Compatible	Incompatible
ASCII Text Efficiency	High (1 byte/char)	Low (2 bytes/char)
Non-Latin Text Efficiency	Lower (3 bytes/char)	Higher (2 bytes/char)
Byte Order Issues	No concern	Needs BOM mark
Use Cases	Network protocols, file storage	Memory operations, large text processing

When processing text, the choice between utf-8 and utf-16 affects file size and character position calculations. This impacts the determination of character positions in the ast. Consider the following example:

const info = '你好';

The ast parsed through babel ast and estree ast specifications will differ in character positions.

babel astestree ast

json

{
  "type": "Module",
  "span": {
    "start": 0,
    "end": 19,
    "ctxt": 0
  },
  "body": [
    {
      "type": "VariableDeclaration",
      "span": {
        "start": 0,
        "end": 19,
        "ctxt": 0
      },
      "kind": "const",
      "declare": false,
      "declarations": [
        {
          "type": "VariableDeclarator",
          "span": {
            "start": 6,
            "end": 18,
            "ctxt": 0
          },
          "id": {
            "type": "Identifier",
            "span": {
              "start": 6,
              "end": 7,
              "ctxt": 0
            },
            "value": "a",
            "optional": false,
            "typeAnnotation": null
          },
          "init": {
            "type": "StringLiteral",
            "span": {
              "start": 10,
              "end": 18,
              "ctxt": 0
            },
            "value": "你好",
            "hasEscape": false,
            "kind": {
              "type": "normal",
              "containsQuote": true
            }
          },
          "definite": false
        }
      ]
    }
  ],
  "interpreter": null
}

json

{
  "type": "Program",
  "start": 0,
  "end": 15,
  "body": [
    {
      "type": "VariableDeclaration",
      "start": 0,
      "end": 15,
      "declarations": [
        {
          "type": "VariableDeclarator",
          "start": 6,
          "end": 14,
          "id": {
            "type": "Identifier",
            "start": 6,
            "end": 7,
            "name": "a"
          },
          "init": {
            "type": "Literal",
            "start": 10,
            "end": 14,
            "value": "你好",
            "raw": "\"你好\""
          }
        }
      ],
      "kind": "const"
    }
  ],
  "sourceType": "module"
}

It can be observed that the two different specification asts handle special characters differently due to different encoding methods, resulting in differences in the parsed ast node positions. The babel ast tree parses the utf-8 encoded 你好 literal with an ast node position range of [10, 18), while the estree ast tree parses the utf-16 encoded literal with an ast node position range of [10, 14).

The source map chapter details how rollup internally generates sourcemap, where rollup relies on the position information provided by estree ast for mapping markers.

export class NodeBase extends ExpressionEntity implements ExpressionNode {
  /**
   * Override to perform special initialisation steps after the scope is
   * initialised
   */
  initialise(): void {
    this.scope.context.magicString.addSourcemapLocation(this.start);
    this.scope.context.magicString.addSourcemapLocation(this.end);
  }
}

Therefore, different encodings will cause serious offset in the sourcemap generated by rollup.

`Performance`

`Optimize Ast Compatibility`

In rust side, by leveraging swc's ability to parse code into babel ast

rust

use swc_compiler_base::parse_js;

pub fn parse_ast(code: String, allow_return_outside_function: bool, jsx: bool) -> Vec<u8> {
  GLOBALS.set(&Globals::default(), || {
    let result = catch_unwind(AssertUnwindSafe(|| {
      let result = try_with_handler(&code_reference, |handler| {
        parse_js(
          cm,
          file,
          handler,
          target,
          syntax,
          IsModule::Unknown,
          Some(&comments),
        )
      });
      match result {
        Err(buffer) => buffer,
        Ok(program) => {
          let annotations = comments.take_annotations();
          let converter = AstConverter::new(&code_reference, &annotations);
          converter.convert_ast_to_buffer(&program)
        }
      }
    }));
  });
}

Through converter.convert_ast_to_buffer(&program) method recursively parsing the babel ast tree parsed by swc, recalculating the estree ast node position information corresponding to the babel ast node position information

rust

/// Converts the given UTF-8 byte index to a UTF-16 byte index.
///
/// To be performant, this method assumes that the given index is not smaller
/// than the previous index. Additionally, it handles "annotations" like
/// `@__PURE__` comments in the process.
///
/// The logic for those comments is as follows:
/// - If the current index is at the start of an annotation, the annotation
///   is collected and the index is advanced to the end of the annotation.
/// - Otherwise, we check if the next character is a white-space character.
///   If not, we invalidate all collected annotations.
///   This is to ensure that we only collect annotations that directly precede
///   an expression and are not e.g. separated by a comma.
/// - If annotations are relevant for an expression, it can "take" the
///   collected annotations by calling `take_collected_annotations`. This
///   clears the internal buffer and returns the collected annotations.
/// - Invalidated annotations are attached to the Program node so that they
///   can all be removed from the source code later.
/// - If an annotation can influence a child that is separated by some
///   non-whitespace from the annotation, `keep_annotations_for_next` will
///   prevent annotations from being invalidated when the next position is
///   converted.
pub(crate) fn convert(&mut self, utf8_index: u32, keep_annotations_for_next: bool) -> u32 {
  if self.current_utf8_index > utf8_index {
    panic!(
      "Cannot convert positions backwards: {} < {}",
      utf8_index, self.current_utf8_index
    );
  }
  while self.current_utf8_index < utf8_index {
    if self.current_utf8_index == self.next_annotation_start {
      let start = self.current_utf16_index;
      let (next_comment_end, next_comment_kind) = self
        .next_annotation
        .map(|a| (a.comment.span.hi.0 - 1, a.kind.clone()))
        .unwrap();
      while self.current_utf8_index < next_comment_end {
        let character = self.character_iterator.next().unwrap();
        self.current_utf8_index += character.len_utf8() as u32;
        self.current_utf16_index += character.len_utf16() as u32;
      }
      if let Annotation(kind) = next_comment_kind {
        self.collected_annotations.push(ConvertedAnnotation {
          start,
          end: self.current_utf16_index,
          kind,
        });
      }
      self.next_annotation = self.annotation_iterator.next();
      self.next_annotation_start = get_annotation_start(self.next_annotation);
    } else {
      let character = self.character_iterator.next().unwrap();
      if !(self.keep_annotations || self.collected_annotations.is_empty()) {
        match character {
          ' ' | '\t' | '\r' | '\n' => {}
          _ => {
            self.invalidate_collected_annotations();
          }
        }
      }
      self.current_utf8_index += character.len_utf8() as u32;
      self.current_utf16_index += character.len_utf16() as u32;
    }
  }
  self.keep_annotations = keep_annotations_for_next;
  self.current_utf16_index
}

At the same time, rollup will convert the babel ast parsed by swc into compatible estree ast binary format in rust side, and then pass it as (array) buffer to javascript.

rust

pub(crate) fn convert_statement(&mut self, statement: &Stmt) {
  match statement {
    Stmt::Break(break_statement) => self.store_break_statement(break_statement),
    Stmt::Block(block_statement) => self.store_block_statement(block_statement, false),
    Stmt::Continue(continue_statement) => self.store_continue_statement(continue_statement),
    Stmt::Decl(declaration) => self.convert_declaration(declaration),
    Stmt::Debugger(debugger_statement) => self.store_debugger_statement(debugger_statement),
    Stmt::DoWhile(do_while_statement) => self.store_do_while_statement(do_while_statement),
    Stmt::Empty(empty_statement) => self.store_empty_statement(empty_statement),
    Stmt::Expr(expression_statement) => self.store_expression_statement(expression_statement),
    Stmt::For(for_statement) => self.store_for_statement(for_statement),
    Stmt::ForIn(for_in_statement) => self.store_for_in_statement(for_in_statement),
    Stmt::ForOf(for_of_statement) => self.store_for_of_statement(for_of_statement),
    Stmt::If(if_statement) => self.store_if_statement(if_statement),
    Stmt::Labeled(labeled_statement) => self.store_labeled_statement(labeled_statement),
    Stmt::Return(return_statement) => self.store_return_statement(return_statement),
    Stmt::Switch(switch_statement) => self.store_switch_statement(switch_statement),
    Stmt::Throw(throw_statement) => self.store_throw_statement(throw_statement),
    Stmt::Try(try_statement) => self.store_try_statement(try_statement),
    Stmt::While(while_statement) => self.store_while_statement(while_statement),
    Stmt::With(_) => unimplemented!("Cannot convert Stmt::With"),
  }
}

Extract information required for estree ast node from the structure of babel ast node, and recalculate the position information under the estree ast specification using utf-16 encoding.

rust

pub(crate) fn convert_item_list_with_state<T, S, F>(
    &mut self,
    item_list: &[T],
    state: &mut S,
    reference_position: usize,
    convert_item: F,
  ) where
  F: Fn(&mut AstConverter, &T, &mut S) -> bool,
{
  // for an empty list, we leave the referenced position at zero
  if item_list.is_empty() {
    return;
  }
  self.update_reference_position(reference_position);
  // store number of items in first position
  self
    .buffer
    .extend_from_slice(&(item_list.len() as u32).to_ne_bytes());
  let mut reference_position = self.buffer.len();
  // make room for the reference positions of the items
  self
    .buffer
    .resize(self.buffer.len() + item_list.len() * 4, 0);
  for item in item_list {
    let insert_position = (self.buffer.len() as u32) >> 2;
    if convert_item(self, item, state) {
      self.buffer[reference_position..reference_position + 4]
        .copy_from_slice(&insert_position.to_ne_bytes());
    }
    reference_position += 4;
  }
}

Of course, it will also collect comments nodes, preparing for rollup's tree shaking later. Note that the comments node is included in the babel ast specification, but not in the estree ast specification. However, the information of the comments node is crucial for rollup's tree shaking, which can enhance the ability of tree shaking.

rollup will collect these comment information in estree ast and store it through the _rollupAnnotations property. In other words, the final returned estree ast is compatible with the estree ast structure and contains the _rollupAnnotations property.

rust

pub(crate) fn take_collected_annotations(
  &mut self,
  kind: AnnotationKind,
) -> Vec<ConvertedAnnotation> {
  let mut relevant_annotations = Vec::new();
  for annotation in self.collected_annotations.drain(..) {
    if annotation.kind == kind {
      relevant_annotations.push(annotation);
    } else {
      self.invalid_annotations.push(annotation);
    }
  }
  relevant_annotations
}
impl<'a> AstConverter<'a> {
  pub(crate) fn store_call_expression(
    &mut self,
    span: &Span,
    is_optional: bool,
    callee: &StoredCallee,
    arguments: &[ExprOrSpread],
    is_chained: bool,
  ) {
  // annotations
  let annotations = self
    .index_converter
    .take_collected_annotations(AnnotationKind::Pure);
}
impl SequentialComments {
  pub(crate) fn add_comment(&self, comment: Comment) {
    if comment.text.starts_with('#') && comment.text.contains("sourceMappingURL=") {
      self.annotations.borrow_mut().push(AnnotationWithType {
        comment,
        kind: CommentKind::Annotation(AnnotationKind::SourceMappingUrl),
      });
      return;
    }
    let mut search_position = comment
      .text
      .chars()
      .nth(0)
      .map(|first_char| first_char.len_utf8())
      .unwrap_or(0);
    while let Some(Some(match_position)) = comment.text.get(search_position..).map(|s| s.find("__"))
    {
      search_position += match_position;
      // Using a byte reference avoids UTF8 character boundary checks
      match &comment.text.as_bytes()[search_position - 1] {
        b'@' | b'#' => {
          let annotation_slice = &comment.text[search_position..];
          if annotation_slice.starts_with("__PURE__") {
            self.annotations.borrow_mut().push(AnnotationWithType {
              comment,
              kind: CommentKind::Annotation(AnnotationKind::Pure),
            });
            return;
          }
          if annotation_slice.starts_with("__NO_SIDE_EFFECTS__") {
            self.annotations.borrow_mut().push(AnnotationWithType {
              comment,
              kind: CommentKind::Annotation(AnnotationKind::NoSideEffects),
            });
            return;
          }
        }
        _ => {}
      }
      search_position += 2;
    }
    self.annotations.borrow_mut().push(AnnotationWithType {
      comment,
      kind: CommentKind::Comment,
    });
  }

  pub(crate) fn take_annotations(self) -> Vec<AnnotationWithType> {
    self.annotations.take()
  }
}

Finally, the returned arraybuffer structure compatible with estree ast is passed to the rollup side, and the rollup side needs to guide the parsing of the arraybuffer compatible with estree ast structure to instantiate the ast class node implemented internally by rollup.

export default class Module {
  async setSource({
    ast,
    code,
    customTransformCache,
    originalCode,
    originalSourcemap,
    resolvedIds,
    sourcemapChain,
    transformDependencies,
    transformFiles,
    ...moduleOptions
  }: TransformModuleJSON & {
    resolvedIds?: ResolvedIdMap;
    transformFiles?: EmittedFile[] | undefined;
  }): Promise<void> {
    // Measuring asynchronous code does not provide reasonable results
    timeEnd('generate ast', 3);
    const astBuffer = await parseAsync(
      code,
      false,
      this.options.jsx !== false
    );
    timeStart('generate ast', 3);
    this.ast = convertProgram(astBuffer, programParent, this.scope);
  }
}

rollup's guidance on buffer level

function convertNode(
  parent: Node | { context: AstContext; type: string },
  parentScope: ChildScope,
  position: number,
  buffer: AstBuffer
): any {
  const nodeType = buffer[position];
  const NodeConstructor = nodeConstructors[nodeType];
  /* istanbul ignore if: This should never be executed but is a safeguard against faulty buffers */
  if (!NodeConstructor) {
    console.trace();
    throw new Error(`Unknown node type: ${nodeType}`);
  }
  const node = new NodeConstructor(parent, parentScope);
  node.type = nodeTypeStrings[nodeType];
  node.start = buffer[position + 1];
  node.end = buffer[position + 2];
  bufferParsers[nodeType](node, position + 3, buffer);
  node.initialise();
  return node;
}

`Optimize Native Interaction`

As mentioned above, directly using the javascript reference exposed by swc will repeatedly serialize and deserialize ast between rust and javascript. When processing complex ast, the parsing efficiency almost erodes the performance advantage of switching to the native parser (rust). The solution is as follows:

Use arraybuffer to transfer the parsed ast between rust and javascript.

Do not consider using the swc's javascript reference, but directly use the swc's rust reference in rust.

rust

use swc_compiler_base::parse_js;

pub fn parse_ast(code: String, allow_return_outside_function: bool, jsx: bool) -> Vec<u8> {
  GLOBALS.set(&Globals::default(), || {
    let result = catch_unwind(AssertUnwindSafe(|| {
      let result = try_with_handler(&code_reference, |handler| {
        parse_js(
          cm,
          file,
          handler,
          target,
          syntax,
          IsModule::Unknown,
          Some(&comments),
        )
      });
      match result {
        Err(buffer) => buffer,
        Ok(program) => {
          let annotations = comments.take_annotations();
          let converter = AstConverter::new(&code_reference, &annotations);
          converter.convert_ast_to_buffer(&program)
        }
      }
    }));
  });
}

At the same time, rollup will convert the swc parsed babel ast to the compatible estree ast binary format in rust, and then pass it as (array) buffer to javascript.

rust

match result {
  Err(buffer) => buffer,
  Ok(program) => {
    let annotations = comments.take_annotations();
    let converter = AstConverter::new(&code_reference, &annotations);
    converter.convert_ast_to_buffer(&program)
  }
}

Passing arraybuffer is basically a lossless operation, so we only need to teach javascript side how to operate arraybuffer. In addition, the size of arraybuffer is only about one-third of the serialized json. Finally, this will allow us to easily pass arraybuffer data format ast to different threads, such as parsing in WebWorker can be completed and then pass the arraybuffer data format ast to the main thread without loss.

In nodejs side, using napi-rs to interact with rust code, and wasm-pack in browser side for building.

`Optimize Semantic Analysis`

`Parser` Semantic Analysis Design

rust side directly calling swc's use swc_compiler_base::parse_js will not execute semantic analysis, only handle lexical analysis and syntax analysis. That is, the following code can be parsed normally in swc

const a = 1;
const a = 2;

This is different from acorn, which performs partial early errors in syntax analysis and semantic analysis when generating ast.

The reason is that acorn is designed as a parser that conforms to the ECMAScript specification. Before the javascript engine executes the code, it requires the execution of Static Semantics: Early Errors steps (essentially static semantic analysis), which are checks and reports that need to be completed in the parsing and early syntax analysis stage. These errors are checked statically, which means they do not need to be executed to be found.

browsers, nodejs and other built-in javascript engines also execute Static Semantics: Early Errors steps before executing the code.

The significance of the specification is:

Early Detection of Issues: It can find potential issues before the code is actually executed, avoiding issues that may appear at runtime.
Performance Improvement: Since these checks are completed in the static analysis stage, they can improve code execution efficiency.
Ensure Language Consistency: Through a unified early error check mechanism, ensure that javascript code can be processed consistently in different environments.
Help Developers Write Better Code: These rules also guide developers to follow better programming practices.

swc, babel and other parsers do not execute Static Semantics: Early Errors steps when generating ast, that is, they are designed differently from acorn. Then let's first introduce why they separate syntax analysis and static semantic analysis.

Performance and Complexity Trade-off
Implementing early errors detection requires the parser to do the following:
- Simulate and maintain the execution context of the current execution statement.
- Static rule check.
  - Detection of other static semantic rules defined in the language specification.
  - Syntax restriction rule detection.
  - Module system static verification rule detection.
Although the detection complexity is not high, in large projects, if the user needs to perform early errors check every time they translate new code, the cumulative complexity of the complete early errors check may bring some performance overhead, which cannot be ignored.
Toolchain Division of Labor
swc, babel and other parsers' focus is on code conversion, mainly injected in the code conversion process of the build system in the form of plugins. If the tool wants to be strongly integrated into the ecosystem of various build systems, the easiest way is to maintain single responsibility principle.
By separating parsing and semantic analysis:
- Parser can focus on generating accurate ast.
- Semantic Analyzer can focus on checking code correctness.
- Each part is easier to maintain and optimize.
Flexibility
In the complex application module translation process, it is usually not a one-time thing, but will exist in the intermediate state, and the intermediate code is largely not in compliance with semantic specifications. If the translation tool performs strict semantic analysis, such code cannot pass the compilation and affect the ability to extend. Modern development toolchain distributes different checks to different stages through different semantic analysis, on-demand execution, and balances development flexibility and code quality.

babel, swc choose to separate the responsibilities of syntax analysis and early errors detection, and in the plugin translation code stage, the code is parsed into ast for lexical analysis and syntax analysis, without executing early errors check(static semantic analysis), and in the suitable time (such as rollup's transform stage) by bundlers(such as rollup) to control and execute early errors check.

This design choice reflects an important principle in engineering practice: sometimes, breaking down a complex problem into multiple independent steps may be more effective than trying to solve everything in one step. This allows each tool to focus on its core task, thereby providing better functionality and performance.

rollup plugin system design inspiration

The above design approach also has some reflection in the rollup plugin system, when the user plugin returns ast in the load(or transform) hook, then rollup will reuse the ast returned by the user plugin in the subsequent transform hook. Before rollup completes the transform stage, rollup will not perform any semantic analysis on the reused ast.

const a = 1;
const a = 2;

For the above example, acorn will provide the following error message

while (this.type !== tt.braceR) {
  const element = this.parseClassElement(node.superClass !== null);
  if (element) {
    classBody.body.push(element);
    if (
      element.type === 'MethodDefinition' &&
      element.kind === 'constructor'
    ) {
      if (hadConstructor)
        this.raiseRecoverable(
          element.start,
          'Duplicate constructor in the same class'
        );
      hadConstructor = true;
    } else if (
      element.key &&
      element.key.type === 'PrivateIdentifier' &&
      isPrivateNameConflicted(privateNameMap, element)
    ) {
      this.raiseRecoverable(
        element.key.start,
        `Identifier '#${element.key.name}' has already been declared`
      );
    }
  }
}

Error Prompt

Line 2: Identifier 'a' has already been declared.

Therefore, rollup needs to leverage swc_ecma_lints capabilities to achieve more complete semantic analysis.

rust

use swc_ecma_lints::{rule::Rule, rules, rules::LintParams};

let result = HANDLER.set(&handler, || op(&handler));

match result {
  Ok(mut program) => {
    let unresolved_mark = Mark::new();
    let top_level_mark = Mark::new();
    let unresolved_ctxt = SyntaxContext::empty().apply_mark(unresolved_mark);
    let top_level_ctxt = SyntaxContext::empty().apply_mark(top_level_mark);

    program.visit_mut_with(&mut resolver(unresolved_mark, top_level_mark, false));

    let mut rules = rules::all(LintParams {
      program: &program,
      lint_config: &Default::default(),
      unresolved_ctxt,
      top_level_ctxt,
      es_version,
      source_map: cm.clone(),
    });

    HANDLER.set(&handler, || match &program {
      Program::Module(m) => {
        rules.lint_module(m);
      }
      Program::Script(s) => {
        rules.lint_script(s);
      }
    });

    if handler.has_errors() {
      let buffer = create_error_buffer(&wr, code);
      Err(buffer)
    } else {
      Ok(program)
    }
  }
}

Implement Semantic Analysis On JavaScript Side

However, from the following PR and discussion it can be known

Semantic Analysis Detection Point

The main tasks of semantic analysis include the following:

const_assign

Example:

logConstVariableReassignErrorcase/AssignmentExpression

export function logConstVariableReassignError() {
  return {
    code: CONST_REASSIGN,
    message: 'Cannot reassign a variable declared with `const`'
  };
}

// case
const x = 1;
x = 'string';

// implementation
export default class AssignmentExpression extends NodeBase {
  initialise(): void {
    super.initialise();
    if (this.left instanceof Identifier) {
      const variable = this.scope.variables.get(this.left.name);
      if (variable?.kind === 'const') {
        this.scope.context.error(
          logConstVariableReassignError(),
          this.left.start
        );
      }
    }
    this.left.setAssignedValue(this.right);
  }
}

duplicate_bindings

export function logRedeclarationError(name: string): RollupLog {
  return {
    code: REDECLARATION_ERROR,
    message: `Identifier "${name}" has already been declared`
  };
}

// case
import { x } from './b';
const x = 1;

// case2
import { x } from './b';
import { x } from './b';

// implementation
export default class Module {
  private addImport(node: ImportDeclaration): void {
    const source = node.source.value;
    this.addSource(source, node);

    for (const specifier of node.specifiers) {
      const localName = specifier.local.name;
      if (
        this.scope.variables.has(localName) ||
        this.importDescriptions.has(localName)
      ) {
        this.error(
          logRedeclarationError(localName),
          specifier.local.start
        );
      }

      const name =
        specifier instanceof ImportDefaultSpecifier
          ? 'default'
          : specifier instanceof ImportNamespaceSpecifier
            ? '*'
            : specifier.imported instanceof Identifier
              ? specifier.imported.name
              : specifier.imported.value;
      this.importDescriptions.set(localName, {
        module: null as never, // filled in later
        name,
        source,
        start: specifier.start
      });
    }
  }
}

// case
{
  const a = 1;
  const a = 1;
}

// implementation
export default class BlockScope extends ChildScope {
  addDeclaration(
    identifier: Identifier,
    context: AstContext,
    init: ExpressionEntity,
    destructuredInitPath: ObjectPath,
    kind: VariableKind
  ): LocalVariable {
    if (kind === 'var') {
      const name = identifier.name;
      const existingVariable =
        this.hoistedVariables?.get(name) ||
        (this.variables.get(name) as LocalVariable | undefined);
      if (existingVariable) {
        if (
          existingVariable.kind === 'var' ||
          (kind === 'var' && existingVariable.kind === 'parameter')
        ) {
          existingVariable.addDeclaration(identifier, init);
          return existingVariable;
        }
        return context.error(
          logRedeclarationError(name),
          identifier.start
        );
      }
      const declaredVariable = this.parent.addDeclaration(
        identifier,
        context,
        init,
        destructuredInitPath,
        kind
      );
      // Necessary to make sure the init is deoptimized for conditional declarations.
      // We cannot call deoptimizePath here.
      declaredVariable.markInitializersForDeoptimization();
      // We add the variable to this and all parent scopes to reliably detect conflicts
      this.addHoistedVariable(name, declaredVariable);
      return declaredVariable;
    }
    return super.addDeclaration(
      identifier,
      context,
      init,
      destructuredInitPath,
      kind
    );
  }
}

// case
try {
} catch (e) {
  const a = 1;
  const a = 2;
}

// implementation
export default class CatchBodyScope extends ChildScope {
  addDeclaration(
    identifier: Identifier,
    context: AstContext,
    init: ExpressionEntity,
    destructuredInitPath: ObjectPath,
    kind: VariableKind
  ): LocalVariable {
    if (kind === 'var') {
      const name = identifier.name;
      const existingVariable =
        this.hoistedVariables?.get(name) ||
        (this.variables.get(name) as LocalVariable | undefined);
      if (existingVariable) {
        const existingKind = existingVariable.kind;
        if (
          existingKind === 'parameter' &&
          // If this is a destructured parameter, it is forbidden to redeclare
          existingVariable.declarations[0].parent.type ===
            NodeType.CatchClause
        ) {
          // If this is a var with the same name as the catch scope parameter,
          // the assignment actually goes to the parameter and the var is
          // hoisted without assignment. Locally, it is shadowed by the
          // parameter
          const declaredVariable = this.parent.parent.addDeclaration(
            identifier,
            context,
            UNDEFINED_EXPRESSION,
            destructuredInitPath,
            kind
          );
          // To avoid the need to rewrite the declaration, we link the variable
          // names. If we ever implement a logic that splits initialization and
          // assignment for hoisted vars, the "renderLikeHoisted" logic can be
          // removed again.
          // We do not need to check whether there already is a linked
          // variable because then declaredVariable would be that linked
          // variable.
          existingVariable.renderLikeHoisted(declaredVariable);
          this.addHoistedVariable(name, declaredVariable);
          return declaredVariable;
        }
        if (existingKind === 'var') {
          existingVariable.addDeclaration(identifier, init);
          return existingVariable;
        }
        return context.error(
          logRedeclarationError(name),
          identifier.start
        );
      }
    }
  }
}

// case
function fn() {
  const a = 1;
  const a = 2;
}

// implementation
export default class FunctionBodyScope extends ChildScope {
  // There is stuff that is only allowed in function scopes, i.e. functions can
  // be redeclared, functions and var can redeclare each other
  addDeclaration(
    identifier: Identifier,
    context: AstContext,
    init: ExpressionEntity,
    destructuredInitPath: ObjectPath,
    kind: VariableKind
  ): LocalVariable {
    const name = identifier.name;
    const existingVariable =
      this.hoistedVariables?.get(name) ||
      (this.variables.get(name) as LocalVariable);
    if (existingVariable) {
      const existingKind = existingVariable.kind;
      if (
        (kind === 'var' || kind === 'function') &&
        (existingKind === 'var' ||
          existingKind === 'function' ||
          existingKind === 'parameter')
      ) {
        existingVariable.addDeclaration(identifier, init);
        return existingVariable;
      }
      context.error(logRedeclarationError(name), identifier.start);
    }
    const newVariable = new LocalVariable(
      identifier.name,
      identifier,
      init,
      destructuredInitPath,
      context,
      kind
    );
    this.variables.set(name, newVariable);
    return newVariable;
  }
}

// case1
import { a } from './b';
const a = 1;

// case2
import { a } from './b';
import { a } from './b';

// implementation
export default class ModuleScope extends ChildScope {
  addDeclaration(
    identifier: Identifier,
    context: AstContext,
    init: ExpressionEntity,
    destructuredInitPath: ObjectPath,
    kind: VariableKind
  ): LocalVariable {
    if (this.context.module.importDescriptions.has(identifier.name)) {
      context.error(
        logRedeclarationError(identifier.name),
        identifier.start
      );
    }
    return super.addDeclaration(
      identifier,
      context,
      init,
      destructuredInitPath,
      kind
    );
  }
}

// case
const a = 1;
const a = 2;

export default class Scope {
  /*
Redeclaration rules:
- var can redeclare var
- in function scopes, function and var can redeclare function and var
- var is hoisted across scopes, function remains in the scope it is declared
- var and function can redeclare function parameters, but parameters cannot redeclare parameters
- function cannot redeclare catch scope parameters
- var can redeclare catch scope parameters in a way
	- if the parameter is an identifier and not a pattern
	- then the variable is still declared in the hoisted outer scope, but the initializer is assigned to the parameter
- const, let, class, and function except in the cases above cannot redeclare anything
 */
  addDeclaration(
    identifier: Identifier,
    context: AstContext,
    init: ExpressionEntity,
    destructuredInitPath: ObjectPath,
    kind: VariableKind
  ): LocalVariable {
    const name = identifier.name;
    const existingVariable =
      this.hoistedVariables?.get(name) ||
      (this.variables.get(name) as LocalVariable);
    if (existingVariable) {
      if (kind === 'var' && existingVariable.kind === 'var') {
        existingVariable.addDeclaration(identifier, init);
        return existingVariable;
      }
      context.error(logRedeclarationError(name), identifier.start);
    }
    const newVariable = new LocalVariable(
      identifier.name,
      identifier,
      init,
      destructuredInitPath,
      context,
      kind
    );
    this.variables.set(name, newVariable);
    return newVariable;
  }
}

duplicate_exports

export function logDuplicateExportError(name: string): RollupLog {
  return {
    code: DUPLICATE_EXPORT,
    message: `Duplicate export "${name}"`
  };
}

export default class Module {
  private assertUniqueExportName(name: string, nodeStart: number) {
    if (this.exports.has(name) || this.reexportDescriptions.has(name)) {
      this.error(logDuplicateExportError(name), nodeStart);
    }
  }
}

// case
export default 1;
export default 2;

// implementation
export default class Module {
  private addExport(
    node:
      | ExportAllDeclaration
      | ExportNamedDeclaration
      | ExportDefaultDeclaration
  ): void {
    if (node instanceof ExportDefaultDeclaration) {
      // export default foo;

      this.assertUniqueExportName('default', node.start);
      this.exports.set('default', {
        identifier: node.variable.getAssignedVariableName(),
        localName: 'default'
      });
    }
  }
}

// case
export * as a from './b';
export * as a from './b';

// implementation
export default class Module {
  private addExport(
    node: ExportAllDeclaration | ExportNamedDeclaration
  ): void {
    if (node instanceof ExportAllDeclaration) {
      const source = node.source.value;
      this.addSource(source, node);
      if (node.exported) {
        // export * as name from './other'

        const name =
          node.exported instanceof Literal
            ? node.exported.value
            : node.exported.name;
        this.assertUniqueExportName(name, node.exported.start);
        this.reexportDescriptions.set(name, {
          localName: '*',
          module: null as never, // filled in later,
          source,
          start: node.start
        });
      } else {
        // export * from './other'

        this.exportAllSources.add(source);
      }
    }
  }
}

// case
export { a } from './b';
export { a } from './b';

// implementation
export default class Module {
  private addExport(
    node: ExportAllDeclaration | ExportNamedDeclaration
  ): void {
    if (node.source instanceof Literal) {
      // export { name } from './other'

      const source = node.source.value;
      this.addSource(source, node);
      for (const { exported, local, start } of node.specifiers) {
        const name =
          exported instanceof Literal ? exported.value : exported.name;
        this.assertUniqueExportName(name, start);
        this.reexportDescriptions.set(name, {
          localName: local instanceof Literal ? local.value : local.name,
          module: null as never, // filled in later,
          source,
          start
        });
      }
    }
  }
}

// case1
export const a = 1;
export const a = 2;

// case2
export function a() {}
export function a() {}

// case3
export { a, a };

// implementation
export default class Module {
  private addExport(node: ExportNamedDeclaration): void {
    if (node.declaration) {
      const declaration = node.declaration;
      if (declaration instanceof VariableDeclaration) {
        // export var { foo, bar } = ...
        // export var foo = 1, bar = 2;

        for (const declarator of declaration.declarations) {
          for (const localName of extractAssignedNames(declarator.id)) {
            this.assertUniqueExportName(localName, declarator.id.start);
            this.exports.set(localName, { identifier: null, localName });
          }
        }
      } else {
        // export function foo () {}

        const localName = (declaration.id as Identifier).name;
        this.assertUniqueExportName(localName, declaration.id!.start);
        this.exports.set(localName, { identifier: null, localName });
      }
    }
  }
}

no_dupe_args

logDuplicateArgumentNameErrorcase/ParameterScope

export function logDuplicateArgumentNameError(name: string): RollupLog {
  return {
    code: DUPLICATE_ARGUMENT_NAME,
    message: `Duplicate argument name "${name}"`
  };
}

// case
function fn(a, a) {}

// implementation
export default class ParameterScope extends ChildScope {
  /**
   * Adds a parameter to this scope. Parameters must be added in the correct
   * order, i.e. from left to right.
   */
  addParameterDeclaration(
    identifier: Identifier,
    argumentPath: ObjectPath
  ): ParameterVariable {
    const { name, start } = identifier;
    const existingParameter = this.variables.get(name);
    if (existingParameter) {
      return this.context.error(
        logDuplicateArgumentNameError(name),
        start
      );
    }
    const variable = new ParameterVariable(
      name,
      identifier,
      argumentPath,
      this.context
    );
    this.variables.set(name, variable);
    // We also add it to the body scope to detect name conflicts with local
    // variables. We still need the intermediate scope, though, as parameter
    // defaults are NOT taken from the body scope but from the parameters or
    // outside scope.
    this.bodyScope.addHoistedVariable(name, variable);
    return variable;
  }
}

From the above implementation, it can be seen that semantic analysis is heavily dependent on the current ast node execution context and scope information. Of course, the above semantic analysis is the most basic, rollup will also perform some other semantic analysis, such as side effect analysis, module loop dependency analysis, strict syntax restrictions (such as namespace object cannot be called, imported references cannot be reassigned, etc.) semantic analysis, etc., which are impossible for acorn.

Since the internal implementation of swc_ecma_lints may have performance issues, this is a temporary solution, and rollup will add scope analysis in rust side later, and then hand over the complete semantic analysis task to rust side. At that time, the complete semantic analysis task will be handed over to rust side for processing.

`Optimize Ast Parsing`

rollup provides this.parser for plugin context to allow user plugins to use native swc capabilities to parse code into ast. User plugins can return parsed ast in load and transform hooks, and rollup will reuse the parsed ast returned by the user plugin.

If the user plugin does not parse ast(i.e., the plugin does not return ast in load and transform hooks), then the ast will be handled as a fallback, and the ast parsed from the translated code will be parsed as compat estree ast in transform stage completion, using native rust capabilities.

precautions for using this.parser

Currently, rollup has removed rust side ast semantic analysis. In other words, using rollup provided this.parser api to parse code into ast in the plugin context has not completed semantic analysis.

If the user plugin needs to generate a ast that is compliant with semantic analysis, then the user plugin needs to use other tools to perform semantic analysis on the ast.

If the user does not need to ensure that the generated ast is compliant with semantic analysis, then rollup will automatically perform semantic analysis when backtracking to recursively instance ast node class.

Even with native parsing capabilities, generating complex ast is still time-consuming. In watch mode, rollup will cache(see Rollup Incremental Build section for details) estree ast to skip the native swc parsing process of ast, recursively instance estree ast structure to instance rollup internal ast class node.

`Performance Comparison`

Tested the parsing capabilities of rollup in 4.28.1 and 3.29.5 versions, where:

4.28.1 version uses native swc to parse ast, and rust side passes compatible estree ast to javascript side through arraybuffer format.

3.29.5 version uses acorn to parse ast.

Each group tested 5 times for average.

Code Length (Character)	SWC Parsing Time (ms)	Acorn Parsing Time (ms)
312.4K	13.47	73.92
624.7K	21.78	83.80
1.2M	36.03	124.82
2.5M	68.88	182.45
5.0M	136.52	272.53
10.0M	266.87	608.72
20.0M	578.00	1178.82
159.9M	4155.64	7276.24
319.9M	10081.40	-

After testing, it was found that when the parsed character amount reached 319,869,952, acorn parsing ast would report an error.

bash

<--- Last few GCs --->

[69821:0x120078000]    15364 ms: Mark-sweep 4062.9 (4143.2) -> 4059.0 (4143.2) MB, 703.2 / 0.0 ms  (average mu = 0.293, current mu = 0.102) allocation failure; scavenge might not succeed
[69821:0x120078000]    16770 ms: Mark-sweep 4075.3 (4143.2) -> 4071.5 (4169.0) MB, 1383.5 / 0.0 ms  (average mu = 0.143, current mu = 0.016) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

From the test results, it can be seen that switching to the native parser has a significant performance advantage over acorn.

Overall performance:
- The average parsing time using the native parser (built-in swc) is relatively short, and the growth is relatively gentle with the increase in code length.
- The parsing time using the non-native parser (built-in acorn) grows significantly with large code, showing high performance overhead.
Data comparison:
- Small code amount (312,373 characters): the gap is relatively obvious, about 5.5 times (13.47 ms vs 73.92 ms).
- Medium code amount (9,995,936 characters): the gap is about 2.28 times (266.87 ms vs 608.72 ms).
- Large code amount (159,934,976 characters): the gap is 1.75 times (4155.64 ms vs 7276.24 ms).
Module Character Quantity Concept
module Code Length (Character)
rollup.js 312,373
Trend analysis:
- The parsing time growth using the native parser (built-in swc) is relatively small, suitable for larger module parsing needs.
- The parsing time growth using the non-native parser (built-in acorn) is relatively large, and the parsing efficiency is significantly insufficient in large module parsing.

module	Code Length (Character)
rollup.js	312,373

Contributors

XiSenao

Changelog

Last edited about 1 month ago

View full history

Native Parser ​

Challenges ​

Native Interaction ​

Ast Compatibility ​

File Encoding ​

Performance ​

Optimize Ast Compatibility ​

Optimize Native Interaction ​

Optimize Semantic Analysis ​

Parser Semantic Analysis Design ​

Implement Semantic Analysis On JavaScript Side ​

Semantic Analysis Detection Point ​

Optimize Ast Parsing ​

Performance Comparison ​

Contributors

Changelog

Discuss

Native Parser

`Challenges`

`Native Interaction`

`Ast Compatibility`

`File Encoding`

`Performance`

`Optimize Ast Compatibility`

`Optimize Native Interaction`

`Optimize Semantic Analysis`

`Parser` Semantic Analysis Design

Implement Semantic Analysis On JavaScript Side

Semantic Analysis Detection Point

`Optimize Ast Parsing`

`Performance Comparison`