Skip to main content

GitHub (Plus Only)

The GitHub source imports files from a GitHub repository using a GitHub App for authentication.

Setup for GitHub

Create and configure a GitHub App

You need a GitHub App with access to the target repository:

  1. In GitHub, go to Settings → Developer settings → GitHub Apps → New GitHub App.
  2. Grant the following permission:
    • Repository permissions → Read-only access to content and metadata
  3. Create and download a Private key (PEM). Save it to a secure path on your machine.
  4. Note the App ID.
  5. Install the App on the repository you want to index.

You will pass the App ID and the private key to the source spec (see below).

Spec

The spec takes the following fields:

  • app (GitHubApp): GitHub App credentials with fields:
    • app_id (int): the GitHub App ID
    • private_key_pem (cocoindex.TransientAuthEntryReference[str], optional): PEM content stored in the auth registry.
    • private_key_path (str, optional): filesystem path to the PEM private key file. Used only if private_key_pem is not provided.
    • rate_limit (cocoindex.RateLimit, optional): rate limit for the GitHub App. This quota is shared among the same app. Please specify the same value across all sources using the same app. CocoIndex will pick the value from a random one, if different values are provided. See Rate Limit for its details.
  • owner (str): repository owner (user or organization)
  • repo (str): repository name
  • path (str, optional): limit to a subdirectory within the repository. If set, all files are addressed relative to this prefix.
  • git_ref (str, optional): branch name, tag, or commit SHA to read, e.g. "main", "v1.0.0". If omitted, the repository's default branch is used.
  • included_patterns (list[str], optional): glob patterns to include, e.g. ["*.rs", "docs/**/*.md"]. If not specified, all files are included.
  • excluded_patterns (list[str], optional): glob patterns to exclude, e.g. ["**/node_modules", "target"]. Exclusions take precedence over inclusions.
  • max_file_size (int, optional): maximum file size in bytes. Files larger than this will be skipped (treated as non-existent). If not specified, no size limit is enforced.
  • api_base_url (str, optional): custom GitHub API base URL. Use this for GitHub Enterprise Server instances. If not specified, defaults to GitHub.com ("https://api.github.com").
info

included_patterns and excluded_patterns use Unix-style glob syntax. See globset syntax for details. Patterns are evaluated relative to path when provided, otherwise relative to the repository root.

GitHub Enterprise Server

To use with GitHub Enterprise Server, set the api_base_url to your enterprise server's API endpoint:

cocoindex.sources.GitHub(
app=cocoindex.sources.GitHubApp(
app_id=12345,
private_key_pem=cocoindex.add_transient_auth_entry("... (your private key in PEM format)"),
# or
# private_key_path="path/to/key.pem",
),
owner="mycompany",
repo="myrepo",
api_base_url="https://github.company.com/api/v3"
)
note

This source supports incremental updates via watermarks by diffing the repository head against prior watermarks. There is no real-time change stream yet. Use a suitable refresh_interval on the flow for periodic updates.

Schema

The output is a KTable with the following sub fields:

  • filename (Str, key): the file path within the repository, relative to path (if set) or the repository root, e.g. "src/lib.rs".
  • content (Str): the file content as UTF-8 text.

Example

You can find an end-to-end example using the GitHub source at: