Documentation

How CODEXT works

A complete technical description of what happens from the moment you drop a folder to the moment the .txt file lands on disk.

Getting started How it works GitHub Connect AI integration Local-first

The bundling pipeline

When you drop a folder or click "Bundle", CODEXT runs five sequential phases. All processing is synchronous and local. No threads are spawned beyond the Rust runtime's internal pool.

Directory traversal

CODEXT walks the entire directory tree recursively using a depth-first traversal. Hidden files (dotfiles) are included unless explicitly excluded. Symlinks are followed one level deep to avoid infinite loops.

Exclusion filtering

Each path is checked against: (1) CODEXT default exclusions, (2) parsed .gitignore rules from the project root and all parent directories, (3) any custom rules in a .codextignore file, (4) the per-file size cap. Files failing any check are listed in the tree as [excluded] but their contents are omitted.

Binary file detection

Each surviving file is read into a byte buffer. CODEXT checks for null bytes and non-UTF-8 sequences in the first 8KB. Files that fail this check are classified as binary — they appear in the file tree map with a [binary] label but no content is included.

Token estimation

CODEXT estimates the total token count of the output before writing. It uses a character-to-token ratio derived from GPT-4's cl100k_base tokenizer (approximately 4 characters per token for English/code content). The estimate is shown in the UI and included in the output header.

Output assembly and write

The output is assembled in memory: header block → project map (ASCII tree) → file contents section (each file preceded by its path). The complete string is written atomically to the target .txt file.

Default exclusions

These directories and files are excluded by default with "Skip defaults" enabled (the default setting). They can all be re-enabled individually in settings if you need to include them.

node_modules/

NPM/Yarn dependency tree. Never part of your source code.

.git/

Git object store and history. Unreadable binary objects.

dist/

Build output. Compiled/minified, not source.

build/

Build output directory used by many frameworks.

.next/

Next.js compiled output and cache.

.nuxt/

Nuxt.js compiled output.

out/

Common static export output directory.

.venv/

Python virtual environment. Third-party packages only.

__pycache__/

Python bytecode cache files.

target/

Rust/Java/Maven build output.

.gradle/

Gradle build cache and configuration.

Pods/

CocoaPods iOS dependency tree.

.DS_Store

macOS filesystem metadata. Not code.

Thumbs.db

Windows thumbnail cache. Not code.

.gitignore parsing

CODEXT reads and applies .gitignore rules from three locations, in order of precedence:

Gitignore resolution order

1. Project root .gitignore
2. Subdirectory .gitignore files (applied to their subtree)
3. Global gitignore (~/.gitignore_global) — if it exists
4. CODEXT default exclusions (always applied if "Skip defaults" is on)
5. .codextignore in project root (Pro only, highest precedence)

Negation patterns (lines starting with !) are supported. Glob patterns (**, ?, character classes) are fully supported.

Binary file detection

A file is classified as binary if any of the following are true after reading the first 8KB:

Binary classification rules

— Contains one or more null bytes (0x00)
— More than 30% of bytes are non-printable (outside 0x09–0x0D, 0x20–0x7E)
— File extension is in the known binary list:
  .png .jpg .jpeg .gif .webp .ico .svg (if binary-encoded)
  .exe .dll .so .dylib .bin
  .zip .tar .gz .bz2 .7z .rar
  .pdf .doc .docx .xls .xlsx
  .woff .woff2 .ttf .eot

Binary files appear in the project map with a [binary] label. Their file size is shown. Contents are never included.

Output format structure

The output is a plain UTF-8 encoded .txt file. It has three sections separated by horizontal dividers. The format is designed to be maximally readable by both humans and LLMs.

Example output

CODEXT PROTOCOL: 1.2.0
Generated: 2026-04-15T14:32:11Z
Mode: full-content
Options: gitignore=true, skip_defaults=true, max_size=500KB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INFO
  Folder name   : my-project
  Full path     : /Users/dev/projects/my-project
  File count    : 84
  Folder count  : 12
  Token est.    : ~34,100 (GPT-4)

━━━ PROJECT MAP ━━━━━━━━━━━━━━━━━━━

📁 my-project/
├── 📁 src/
│   ├── 📁 components/
│   │   ├── 📄 Button.tsx
│   │   └── 📄 Modal.tsx
│   ├── 📄 index.ts
│   └── 📄 utils.ts
├── 📄 package.json
└── 📄 tsconfig.json

━━━ FILE CONTENTS ━━━━━━━━━━━━━━━━━

[FILE: src/index.ts]
// file content here
...

The file tree uses Unicode box-drawing characters (same as the tree command). File content sections use a consistent [FILE: path/to/file.ext] header that models can parse reliably.

Token estimation

The token estimator runs before writing the output file. It counts the approximate number of tokens the output will consume in a model's context window. This helps you decide whether to split the bundle, add exclusions, or increase the size cap threshold.

Estimation method

GPT-4 / Claude 3: ~4 characters per token (English + code average)
Estimate shown in UI before bundle completes
Estimate included in output header for reference
Actual token count varies ±15% depending on content type
Code-heavy projects tokenize at ~3.5 chars/token
Documentation-heavy projects at ~4.5 chars/token

How CODEXT works

Everything you need to know