CocoIndex Built-in Functions

ParseJson

ParseJson parses a given text to JSON.

Input data:

text (Str): The source text to parse.
language (Optional[Str], default: "json"): The language of the source text. Only json is supported now.

Return: Json, the parsed JSON object.

DetectProgrammingLanguage

DetectProgrammingLanguage detects the programming language of a file based on its filename extension.

Input data:

filename (Str): The filename (with extension) to detect the language for.

Return: Str or Null. Returns the programming language name if the file extension is recognized, or Null if the extension is not supported.

The returned string values match the language name listed in tree-sitter-language-pack.

SplitRecursively

SplitRecursively splits a document into chunks of a given size. It tries to split at higher-level boundaries. If each chunk is still too large, it tries at the next level of boundaries. For example, for a Markdown file, it identifies boundaries in this order: level-1 sections, level-2 sections, level-3 sections, paragraphs, sentences, etc.

The spec takes the following fields:

custom_languages (list[CustomLanguageSpec], optional): This allows you to customize the way to chunking specific languages using regular expressions. Each CustomLanguageSpec is a dict with the following fields:
- language_name (str): Name of the language.
- aliases (list[str], optional): A list of aliases for the language. It's an error if any language name or alias is duplicated.
- separators_regex (list[str]): A list of regex patterns to split the text. Higher-level boundaries should come first, and lower-level should be listed later. e.g. [r"\n# ", r"\n## ", r"\n\n", r"\. "]. See regex syntax for supported regular expression syntax.

Input data:

text (Str): The text to split.
chunk_size (Int64): The maximum size of each chunk, in bytes.
min_chunk_size (Int64, default: chunk_size / 2): The minimum size of each chunk, in bytes.

note

SplitRecursively will do its best to make the output chunks sized between min_chunk_size and chunk_size. However, it's possible that some chunks are smaller than min_chunk_size or larger than chunk_size in rare cases, e.g. too short input text, or non-splittable large text.

Please avoid setting min_chunk_size to a value too close to chunk_size, to leave more rooms for the function to plan the optimal chunking.

chunk_overlap (Optional[Int64], default: None): The maximum overlap size between adjacent chunks, in bytes.
language (Optional[Str], default: None): The language of the document.

It can be a language name (e.g. python, javascript, markdown) or a file extension (e.g. .py, .js, .md).

When it's not provided or doesn't match any known language, the input will be treated as plain text.

note

We use the language field to determine how to split the input text, following these rules:

We match the input language field against the following registries in the following order:
- custom_languages in the spec, against the language_name or aliases field of each entry. If language is not provided (None), it'll be matched against a entry with language_name == "" (empty string).
- Builtin languages (see Supported Languages section below), against the language, aliases or file extensions of each entry.
All matches are in a case-insensitive manner.
If no match is found, the input will be treated as plain text.

Return: KTable, each row represents a chunk, with the following sub fields:

location (Range): The location of the chunk.
text (Str): The text of the chunk.
start / end (Struct): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
- offset (Int64): The byte offset of the position.
- line (Int64): The line number of the position. Starting from 1.
- column (Int64): The column number of the position. Starting from 1.

Supported Languages

Currently, SplitRecursively supports the following languages:

Language	Aliases	File Extensions
c		`.c`
cpp	c++	`.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`
csharp	csharp, cs	`.cs`
css		`.css`, `.scss`
dtd		`.dtd`
fortran	f, f90, f95, f03	`.f`, `.f90`, `.f95`, `.f03`
go	golang	`.go`
html		`.html`, `.htm`
java		`.java`
javascript	js	`.js`
json		`.json`
kotlin		`.kt`, `.kts`
markdown	md	`.md`, `.mdx`
pascal	pas, dpr, delphi	`.pas`, `.dpr`
php		`.php`
python		`.py`
r		`.r`
ruby		`.rb`
rust	rs	`.rs`
scala		`.scala`
solidity		`.sol`
sql		`.sql`
swift		`.swift`
toml		`.toml`
tsx		`.tsx`
typescript	ts	`.ts`
xml		`.xml`
yaml		`.yaml`, `.yml`

If you don't specify the language field, or the language you specified doesn't match any known language, the input will be treated as plain text, in which case the input text is treated as an article and split will be based on blank lines, punctuation marks, whitespaces, etc.

SplitBySeparators

SplitBySeparators splits text by specified regex separators only, without recursive chunking. This is useful when you want direct control over how text is split, e.g. splitting by blank lines or specific delimiters.

The spec takes the following fields:

separators_regex (list[str]): A list of regex patterns to use as separators. See regex syntax for supported regular expression syntax.
keep_separator (Literal["NONE", "LEFT", "RIGHT"], default: "NONE"): Whether to attach the matched separator to the chunk on its left or right, or discard it.
include_empty (bool, default: False): Whether to include empty chunks in the output.
trim (bool, default: True): Whether to trim whitespace from each chunk.

Input data:

text (Str): The text to split.

Return: KTable, each row represents a chunk, with the following sub fields:

location (Range): The location of the chunk.
text (Str): The text of the chunk.
start / end (Struct): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
- offset (Int64): The byte offset of the position.
- line (Int64): The line number of the position. Starting from 1.
- column (Int64): The column number of the position. Starting from 1.

SentenceTransformerEmbed

SentenceTransformerEmbed embeds a text into a vector space using the SentenceTransformer library.

Optional Dependency Required

This function requires the 'sentence-transformers' library, which is an optional dependency. Install CocoIndex with:

pip install 'cocoindex[embeddings]'

The spec takes the following fields:

model (str): The name of the SentenceTransformer model to use.
args (dict[str, Any], optional): Additional arguments to pass to the SentenceTransformer constructor. e.g. {"trust_remote_code": True}

Input data:

text (Str): The text to embed.

Return: Vector[Float32, N], where N is determined by the model

ExtractByLlm

ExtractByLlm extracts structured information from a text using specified LLM. The spec takes the following fields:

llm_spec (cocoindex.LlmSpec): The specification of the LLM to use. See LLM Spec for more details.
output_type (type): The type of the output. e.g. a dataclass type name. See Data Types for all supported data types. The LLM will output values that match the schema of the type.
instruction (str, optional): Additional instruction for the LLM.

Clear type definitions

Definitions of the output_type is fed into LLM as guidance to generate the output. To improve the quality of the extracted information, giving clear definitions for your dataclasses is especially important, e.g.

Provide readable field names for your dataclasses.
Provide reasonable docstrings for your dataclasses.
For any optional fields, clearly annotate that they are optional, by SomeType | None or typing.Optional[SomeType].

Input data:

text (Str): The text to extract information from.

Return: As specified by the output_type field in the spec. The extracted information from the input text.

EmbedText

EmbedText embeds a text into a vector space using various LLM APIs that support text embedding.

The spec takes the following fields:

api_type (cocoindex.LlmApiType): The type of LLM API to use for embedding.
model (str): The name of the embedding model to use.
address (str, optional): The address of the LLM API. If not specified, uses the default address for the API type.
output_dimension (int, optional): The dimension to request from the embedding API. Some APIs support specifying the output dimension (e.g., OpenAI's models support dimension reduction). If not specified, the API will use its default dimension.
expected_output_dimension (int, optional): The expected dimension of the output embedding vector for validation and type schema. If not specified, falls back to output_dimension, then to the default dimension of the model.

For most API types, the function internally keeps a registry for the default output dimension of known models. You need to explicitly specify expected_output_dimension (or output_dimension) if you want to use a new model that is not in the registry yet.
task_type (str, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases.

Supported APIs for Text Embedding

Not all LLM APIs support text embedding. See the LLM API Types table for which APIs support text embedding functionality.

Input data:

text (Str): The text to embed.

Return: Vector[Float32, N], where N is the dimension of the embedding vector determined by the model.

ColPali Functions

ColPali functions enable multimodal document retrieval using ColVision models. These functions support ALL models available in the colpali-engine library, including:

ColPali models (colpali-*): PaliGemma-based, best for general document retrieval
ColQwen2 models (colqwen-*): Qwen2-VL-based, excellent for multilingual text (29+ languages) and general vision
ColSmol models (colsmol-*): Lightweight, good for resource-constrained environments
Any future ColVision models supported by colpali-engine

These models use late interaction between image patch embeddings and text token embeddings for retrieval.

Optional Dependency Required

These functions require the colpali-engine library, which is an optional dependency. Install CocoIndex with:

pip install 'cocoindex[colpali]'

ColPaliEmbedImage

ColPaliEmbedImage embeds images using ColVision multimodal models.

The spec takes the following fields:

model (str): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the complete list of supported models.

Input data:

img_bytes (Bytes): The image data in bytes format.

Return: Vector[Vector[Float32, N]], where N is the hidden dimension determined by the model. This returns a multi-vector format with variable patches and fixed hidden dimension.

ColPaliEmbedQuery

ColPaliEmbedQuery embeds text queries using ColVision multimodal models.

This produces query embeddings compatible with ColVision image embeddings for late interaction scoring (MaxSim).

The spec takes the following fields:

model (str): Any ColVision model name supported by colpali-engine (e.g., "vidore/colpali-v1.2", "vidore/colqwen2.5-v0.2", "vidore/colsmol-v1.0"). See the complete list of supported models.

Input data:

query (Str): The text query to embed.

Return: Vector[Vector[Float32, N]], where N is the hidden dimension determined by the model. This returns a multi-vector format with variable tokens and fixed hidden dimension.

ExtractCodeElements (Plus Only)

ExtractCodeElements extracts structural declarations and call references from source code using a TreeSitter AST parse. It identifies classes, functions, methods, and call sites, returning their names, qualified paths, namespace context, and source positions.

Understanding TreeSitter node kinds and fields

ExtractCodeElements uses TreeSitter to parse source code into an Abstract Syntax Tree (AST). Every node in the tree has a kind — the rule name from the grammar (e.g. "class_declaration", "function_definition") — and zero or more named fields that point to specific children (e.g. name, body, type).

The node kinds and field names you supply in the config below come directly from each language's grammar file:

C#: tree-sitter-c-sharp/grammar.js
Python: tree-sitter-python/grammar.js

In a grammar file, top-level rule names (e.g. class_declaration: $ => ...) become node kinds, and references like field('name', ...) or $.name define named fields. The built-in default configuration shown in the examples below is a good starting point for the most common patterns.

The spec takes the following fields:

languages (dict[str, CodeElementsLanguageConfig], optional): Per-language extraction configuration. The key is the language name (case-insensitive, e.g. "python", "csharp"). If omitted or None, all built-in defaults are used. If provided, only the languages listed in the map are enabled, and each language's node-kind configuration is replaced by the user-supplied values (the tree-sitter grammar and language-specific logic are always provided by the built-in defaults and cannot be overridden).

Each CodeElementsLanguageConfig has the following fields (all optional, default to empty):
- declaration_node_kinds (dict[str, CodeElementsDeclarationConfig]): AST node kinds treated as declarations (classes, functions, methods, etc.). The key is the TreeSitter node kind (the value returned by node.kind()). Each CodeElementsDeclarationConfig has:
  - name_field (str): The named child field that holds the declaration's identifier.
  - body_field (str, optional): The named child field that holds the body, used to determine has_body. If omitted, has_body is always false.
- reference_node_kinds (dict[str, CodeElementsReferenceConfig]): AST node kinds treated as call or type references (invocations, object creation, etc.). The key is the TreeSitter node kind. Each CodeElementsReferenceConfig has:
  - path_expr_field (str): The named child field that holds the path expression (e.g. "function" for a call node, "type" for a parameter annotation).
- type_list_node_kinds (dict[str, CodeElementsTypeListConfig]): AST node kinds whose named children are each emitted as a separate type reference (e.g. base class lists, generic type argument lists). The key is the TreeSitter node kind. CodeElementsTypeListConfig has no fields.
- namespace_node_kinds (dict[str, CodeElementsNamespaceConfig]): AST node kinds that introduce a namespace scope. The key is the TreeSitter node kind. Each CodeElementsNamespaceConfig has:
  - name_field (str): The named child field that holds the namespace name.
- exclude_reference_patterns (list[str], default: []): Regex patterns matched against referenced_full_path of each reference. Any reference whose full path matches is dropped. Each pattern is automatically anchored to match the full string (i.e. the engine wraps each pattern with ^(?:...)$), so you write r"[A-Z]" instead of r"^[A-Z]$". Patterns are |-joined and precompiled into a single regex at construction time. See regex syntax for supported syntax.
  
  Built-in defaults:
  - C#: Empty list — C# built-in types (int, string, bool, void, etc.) are automatically excluded because the TreeSitter C# grammar parses them as predefined_type nodes, which are distinct from user-defined identifiers.
  - Python: [r"int|str|float|bool|list|dict|set|tuple|bytes|complex|object|None|type"] — Python built-in types are regular identifier nodes in the grammar (indistinguishable from user types), so they are excluded via this default pattern.
  Example use cases:
  - Exclude everything under System: r"System\..*"
  - Exclude single-letter generic type parameters: r"[A-Z]"

Input data:

code (Str): The source code to analyze.
language (Str): The programming language. Supported values: csharp, python.
base_namespace (Optional[Str], default: None): The module/namespace prefix for the file. Used by languages that derive namespace from file path rather than source declarations (e.g. Python).

Return: Struct with two sub-fields:

declarations (LTable): Structural declarations found in the source, in document order. Each row has the following fields:
- namespace (Str): Namespace at the point of declaration (e.g. "MyApp.Services").
- entity_name (Str): Fully qualified name within its namespace (e.g. "OrderService.PlaceOrder").
- parent_entity_name (Str or Null): Entity name of the enclosing declaration, if any.
- base_name (Str): Simple (unqualified) name (e.g. "PlaceOrder").
- ast_node_kind (Str): TreeSitter node kind (e.g. "class_declaration", "method_declaration").
- has_body (Bool): Whether the declaration has a meaningful body (e.g. false for interface methods or abstract stubs).
- start / end (Struct): Start/end position of the declaration node, with sub-fields:
  - offset (Int64): Char offset. Starting from 0.
  - line (Int64): Line number. Starting from 1.
  - column (Int64): Column number. Starting from 1.
references (LTable): Call and object-creation references found in the source, in document order. Each row has the following fields:
- namespace (Str): Namespace at the point of the reference.
- parent_entity_name (Str or Null): Entity name of the enclosing declaration, if any.
- referenced_base_name (Str): Simple name of the referenced entity (e.g. "Process").
- referenced_full_path (Str): Full dotted path of the reference (e.g. "helper.Process").
- ast_node_kind (Str): TreeSitter node kind (e.g. "invocation_expression", "call").
- start / end (Struct): Start/end position with offset, line, column sub-fields (same as above).

Supported Languages

Language	Key
C#	`csharp`
Python	`python`

Examples

Using built-in defaults — omit languages to use the default configuration for all supported languages:

file["elements"] = file["content"].transform(
    cocoindex.functions.ExtractCodeElements(),
    language=file["language"],
)

Spelling out the default configuration explicitly — useful as a starting point for customization:

file["elements"] = file["content"].transform(
    cocoindex.functions.ExtractCodeElements(
        languages={
            "python": cocoindex.functions.CodeElementsLanguageConfig(
                declaration_node_kinds={
                    "class_definition": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "function_definition": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                },
                reference_node_kinds={
                    "call": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="function"
                    ),
                    "typed_parameter": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="type"
                    ),
                    "typed_default_parameter": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="type"
                    ),
                },
                exclude_reference_patterns=[
                    r"int|str|float|bool|list|dict|set|tuple|bytes|complex|object|None|type",
                ],
            ),
            "csharp": cocoindex.functions.CodeElementsLanguageConfig(
                declaration_node_kinds={
                    "class_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "struct_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "interface_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "enum_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "record_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "method_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                    "constructor_declaration": cocoindex.functions.CodeElementsDeclarationConfig(
                        name_field="name", body_field="body"
                    ),
                },
                reference_node_kinds={
                    "invocation_expression": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="function"
                    ),
                    "object_creation_expression": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="type"
                    ),
                    "parameter": cocoindex.functions.CodeElementsReferenceConfig(
                        path_expr_field="type"
                    ),
                },
                type_list_node_kinds={
                    "base_list": cocoindex.functions.CodeElementsTypeListConfig(),
                    "type_argument_list": cocoindex.functions.CodeElementsTypeListConfig(),
                },
                namespace_node_kinds={
                    "namespace_declaration": cocoindex.functions.CodeElementsNamespaceConfig(
                        name_field="name"
                    ),
                },
            ),
        }
    ),
    language=file["language"],
)

Chonkie Functions (Plus Only)

Chonkie functions provide advanced text chunking capabilities using the Chonkie library. These functions offer various chunking strategies including recursive, semantic, neural, and code-aware chunking.

Optional Dependency Required

These functions require the chonkie library with appropriate extras, which is an optional dependency. Install as needed:

pip install "cocoindex[chonkie]"      # For all Chonkie functions (recommended)
# Or install individual extras:
pip install chonkie                   # For ChonkieRecursiveChunker
pip install "chonkie[code]"          # For ChonkieCodeChunker
pip install "chonkie[semantic]"      # For ChonkieSemanticChunker
pip install "chonkie[neural]"        # For ChonkieNeuralChunker

For more information, see the Chonkie documentation.

ChonkieRecursiveChunker

ChonkieRecursiveChunker chunks text recursively using the Chonkie's RecursiveChunker.

The spec takes the following fields:

tokenizer (str, default: "character"): The tokenizer to use for chunking. Only str is supported.
chunk_size (int, default: 2048): Maximum number of tokens per chunk.
min_characters_per_chunk (int, default: 24): Minimum number of characters per chunk.

Input data:

text (Str): The text to chunk.

Return: LTable, each row represents a chunk, with the following sub fields:

location (Range): The location of the chunk.
text (Str): The text of the chunk.
start / end (Struct): Details about the start position (inclusive) and end position (exclusive) of the chunk. They have the following sub fields:
- offset (Int64): The byte offset of the position.
- line (Int64): The line number of the position. Starting from 1.
- column (Int64): The column number of the position. Starting from 1.

ChonkieCodeChunker

ChonkieCodeChunker chunks code using Abstract Syntax Trees (ASTs) with the Chonkie's CodeChunker.

The spec takes the following fields:

tokenizer (str, default: "character"): The tokenizer to use for chunking.
chunk_size (int, default: 2048): Maximum number of tokens per chunk.

Input data:

text (Str): The code text to chunk.
language (Str): The programming language of the code (e.g., "python", "javascript", "java").

Return: LTable, each row represents a chunk with the same structure as ChonkieRecursiveChunker.

ChonkieSemanticChunker

ChonkieSemanticChunker chunks text based on semantic similarity using the Chonkie's SemanticChunker.

The spec takes the following fields:

embedding_model (str, default: "minishlab/potion-base-32M"): Model for semantic embeddings.
threshold (float, default: 0.8): Similarity threshold (0-1).
chunk_size (int, default: 2048): Maximum tokens per chunk.
similarity_window (int, default: 3): Sentences to consider for similarity.
min_sentences_per_chunk (int, default: 1): Minimum sentences per chunk.
skip_window (int, default: 0): Groups to skip when merging.
filter_window (int, default: 5): Savitzky-Golay filter window length.
filter_polyorder (int, default: 3): Polynomial order for filter.

Input data:

text (Str): The text to chunk.

Return: LTable, each row represents a chunk with the same structure as ChonkieRecursiveChunker.

ChonkieNeuralChunker

ChonkieNeuralChunker chunks text using a neural model to detect semantic shifts with Chonkie's NeuralChunker.

The spec takes the following fields:

model (str, default: "mirth/chonky_modernbert_base_1"): Fine-tuned BERT model for detecting semantic shifts.
device_map (str, default: "cpu"): Device for model inference (cpu, cuda, mps).
min_characters_per_chunk (int, default: 10): Minimum characters required for a valid chunk.

Input data:

text (Str): The text to chunk.

Return: LTable, each row represents a chunk with the same structure as ChonkieRecursiveChunker.

ParseJson​

DetectProgrammingLanguage​

SplitRecursively​

Supported Languages​

SplitBySeparators​

SentenceTransformerEmbed​

ExtractByLlm​

EmbedText​

ColPali Functions​

ColPaliEmbedImage​

ColPaliEmbedQuery​

ExtractCodeElements (Plus Only)​

Supported Languages​

Examples​

Chonkie Functions (Plus Only)​

ChonkieRecursiveChunker​

ChonkieCodeChunker​

ChonkieSemanticChunker​

ChonkieNeuralChunker​

ParseJson

DetectProgrammingLanguage

SplitRecursively

Supported Languages

SplitBySeparators

SentenceTransformerEmbed

ExtractByLlm

EmbedText

ColPali Functions

ColPaliEmbedImage

ColPaliEmbedQuery

ExtractCodeElements (Plus Only)

Supported Languages

Examples

Chonkie Functions (Plus Only)

ChonkieRecursiveChunker

ChonkieCodeChunker

ChonkieSemanticChunker

ChonkieNeuralChunker