loke.dev
Header image for Anatomy of a Ranked Search: Scaling SQLite FTS5 to Millions of Documents

Anatomy of a Ranked Search: Scaling SQLite FTS5 to Millions of Documents

Move beyond slow pattern matching and learn how to orchestrate virtual tables, shadow indexes, and BM25 ranking to build a production-grade search engine directly in your database.

· 7 min read

I spent weeks trying to optimize a "simple" search feature for a document management system. My first instinct, like many, was to lean on LIKE operators with wildcards. It worked fine with 1,000 rows. At 100,000 rows, it was sluggish. At a million rows, the database basically gave up, locking the CPU while it performed exhaustive table scans. I thought I was forced to migrate to a heavyweight like Elasticsearch or Meilisearch—adding massive infrastructure overhead for a project that was supposed to stay lean. Then I actually dug into SQLite’s FTS5 (Full-Text Search) extension. It wasn't just a "good enough" alternative; it was a revelation.

When you hit the limits of standard B-Tree indexing, you aren't just hitting a performance wall—you’re hitting a conceptual one. Relational databases are built to find exact matches or ranges. Search engines are built to find *relevance*.

Here is how you bridge that gap using SQLite FTS5 to handle millions of documents without breaking a sweat.

The Engine Under the Hood: The Inverted Index

Standard SQLite tables use B-Trees. If you're looking for a specific ID, B-Trees are lightning-fast. But if you're looking for the word "idiosyncratic" inside a 500-page PDF text blob, a B-Tree is useless.

FTS5 uses an Inverted Index. Imagine the index at the back of a massive textbook. Instead of listing pages and then their content, it lists words (tokens) and then the "page numbers" (rowids) where they appear.

When you create an FTS5 virtual table, SQLite doesn't just make one table. It creates several "shadow tables" (with names like t_data, t_idx, t_content) to manage this map. This is why you can’t just ALTER an FTS5 table; it’s a complex orchestration of data structures optimized for prefix and proximity lookups.

Architecture: Content vs. Contentless vs. External Content

The biggest mistake developers make when scaling to millions of rows is using the default FTS5 setup. By default, FTS5 duplicates your data. If you have a 10GB documents table and you create an FTS5 table on top of it, you now have a 20GB database.

For millions of documents, you have three architectural choices:

1. Standard FTS5: Stores everything twice. High overhead, but easiest to use.
2. Contentless Tables (`content=''`): Stores only the index, not the original text. You can search, but you can't retrieve the original text from the FTS table or use snippets.
3. External Content Tables: The "Pro" move. You tell FTS5 to use your existing normal table as the source of truth.

Setting up an External Content Table

This is the most efficient way to scale. Your main table holds the data, and FTS5 holds the pointers.

-- The main table
CREATE TABLE documents (
    id INTEGER PRIMARY KEY,
    title TEXT,
    body TEXT,
    updated_at TIMESTAMP
);

-- The virtual search table
CREATE VIRTUAL TABLE documents_fts USING fts5(
    title,
    body,
    content='documents', -- Link to the main table
    content_rowid='id'   -- Map FTS rowid to our primary key
);

The catch? Since it's an external table, SQLite won't automatically update the index when you change the documents table. You have to manage the synchronization yourself using triggers.

-- Keep the index in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, body) VALUES (new.id, new.title, new.body);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, body) VALUES('delete', old.id, old.title, old.body);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, body) VALUES('delete', old.id, old.title, old.body);
  INSERT INTO documents_fts(rowid, title, body) VALUES (new.id, new.title, new.body);
END;

The Magic of BM25 Ranking

Search is useless if the most relevant result is on page 10. In FTS5, the bm25() function is your best friend. BM25 (Best Matching 25) is a ranking function that estimates the relevance of a document to a given search query.

It’s smarter than simple word counting. It considers:
1. Term Frequency: How many times does the word appear in this document?
2. Inverse Document Frequency: If the word "the" appears a lot, it’s not important. If "Oppenheimer" appears, it’s highly significant.
3. Field Weighting: You can make a match in the title worth more than a match in the body.

Writing a Ranked Query

Here is how you actually execute a production-grade search. Notice the use of the RANK keyword and the weights passed to bm25.

SELECT 
    d.title, 
    snippet(documents_fts, 1, '<b>', '</b>', '...', 10) as highlight,
    rank
FROM documents_fts f
JOIN documents d ON f.rowid = d.id
WHERE documents_fts MATCH 'sqlite performance'
ORDER BY rank -- Smallest (most negative) values are most relevant
LIMIT 20;

Pro tip: In FTS5, bm25() returns a negative value where smaller numbers indicate higher relevance. If you want to weight the title column as 10x more important than the body, you use:
ORDER BY bm25(documents_fts, 10.0, 1.0)

Tokenizers: Why "Running" should match "Run"

If your users search for "dancing," and your document contains "dance," a standard search fails. This is where tokenizers come in.

SQLite comes with a few built-in:
- unicode61: The default. Handles Unicode characters and case-folding.
- porter: Implements the Porter Stemming algorithm (e.g., "running" becomes "run").

For a better user experience at scale, use the porter tokenizer wrapped around unicode61.

CREATE VIRTUAL TABLE documents_fts USING fts5(
    title, 
    body,
    tokenize='porter unicode61'
);

Scaling to Millions: Performance Tuning

When you cross the million-document threshold, disk I/O and index size become your primary bottlenecks.

1. Prefix Indexes

If your users expect autocomplete (searching sql and finding sqlite), FTS5 normally has to scan a lot of the index. You can speed this up by pre-calculating prefix indexes.

-- Create an index that optimizes for 2 and 3 character prefixes
CREATE VIRTUAL TABLE documents_fts USING fts5(
    title, body,
    prefix='2,3'
);

2. The optimize Command

Over time, as you insert and delete documents, the FTS index becomes fragmented. It’s like a messy room. The optimize command is like a deep clean—it merges the internal B-Tree structures of the virtual table into a single large B-Tree.

Run this during low-traffic periods:

INSERT INTO documents_fts(documents_fts) VALUES('optimize');

3. Page Size and Cache

SQLite’s default page size is often 4KB. For large FTS indexes, increasing this to 8KB or 16KB can significantly reduce the number of I/O operations required for a search.

PRAGMA page_size = 16384;
VACUUM;

Dealing with the "Big Text" Gotcha

If you are indexing millions of rows, the index size can explode. I once saw an FTS index that was 3x the size of the original data because of high-cardinality tokens (like unique hashes or serial numbers indexed as text).

Rule of thumb: Only index what the user actually searches. Don't index GUIDs, URLs, or metadata timestamps in FTS5. Keep those in your standard table and filter them via a JOIN.

The Multi-Term Trap

Users don't search like computers. They type sqlite search slow. FTS5 expects a specific syntax. If you pass that string directly to a MATCH clause, it might fail if the user includes characters like - or :.

Always sanitize and format your input. In your application layer (Python/Node/Go), convert user input into a format FTS5 likes:

# Simple Python example to sanitize FTS5 input
def sanitize_fts_query(user_input):
    # Remove special FTS characters and wrap words in * for prefix matching
    clean_input = "".join(c if c.isalnum() else " " for c in user_input)
    words = clean_input.split()
    return " AND ".join([f"{word}*" for word in words])

# "sqlite search" becomes "sqlite* AND search*"

When SQLite Isn't Enough

I love SQLite, but I'm not a zealot. If you need these features, it’s time to move to something like Meilisearch or Solr:
- Fuzzy Matching: FTS5 doesn't do Levenshtein distance (e.g., "sqllite" matching "sqlite") out of the box without a lot of hacking.
- Distributed Search: If your index is so large it won't fit on one NVMe drive, SQLite is no longer the tool.
- Aggressive Highlighting: While FTS5 has a snippet() function, doing heavy regex-based highlighting across millions of matches can be slow.

The Verdict

For 90% of applications, a properly configured SQLite FTS5 setup is faster and easier to maintain than a separate search cluster. By using external content tables, the Porter tokenizer, and BM25 ranking, you can deliver a world-class search experience directly from your local database file.

Don't let the "lite" in SQLite fool you. It handles millions of documents just fine; you just have to know which levers to pull. Stop writing LIKE queries and start building a real engine.