Fairness Considerations¶

How we ensure fair, unbiased comparisons between graph database engines.

Core Principles¶

These benchmarks follow these fairness principles:

Idiomatic Queries - Each engine uses its best-practice patterns
Equal Treatment - All engines measured the same way
Transparent Reporting - All methodology is public
Continuous Improvement - Results updated with feedback

Engine-Specific Optimizations¶

Query Patterns¶

Each engine uses queries optimized for that platform:

CongraphDB (Cypher)¶

-- Variable-length path with inline filter
MATCH path = (p:Paper {id: $id})-[:CITES*1..$hops]->(end:Paper)
RETURN count(DISTINCT end)

Neo4j (Cypher + GDS)¶

-- Using GDS library when available
CALL gds.pageRank.stream('my-graph', {
  maxIterations: 10,
  dampingFactor: 0.85
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id, score
ORDER BY score DESC

SQLite (Recursive CTE)¶

-- Optimized recursive query with indexes
WITH RECURSIVE traversal AS (
  SELECT target_id, 1 as depth, source_id
  FROM edges
  WHERE source_id = ? AND label = 'CITES'
  UNION ALL
  SELECT e.target_id, t.depth + 1, e.source_id
  FROM edges e
  JOIN traversal t ON e.source_id = t.target_id
  WHERE t.depth < ? AND e.label = 'CITES'
)
SELECT COUNT(DISTINCT target_id) FROM traversal;

Graphology (JavaScript API)¶

// Using built-in traversal methods
const bfs = graphology.traversal.bfs;
const count = bfs(graph, {
  order: 'pre',
  filter: (node, attr) => attr.id === startId
});

Configuration Choices¶

Memory Settings¶

Each engine configured for fair comparison:

Engine	Memory Setting	Rationale
CongraphDB	Default (auto)	Embedded, automatic management
Neo4j	1GB heap + 1GB pagecache	Standard for this dataset size
Graphology	Node.js default (1.5GB)	No custom configuration available
SQLite	Default	Minimal overhead

Index Strategy¶

All engines use similar indexing:

-- CongraphDB (automatic)
CREATE INDEX ON :Paper(id);

-- Neo4j (explicit)
CREATE INDEX paper_id FOR (p:Paper) ON (p.id);

-- SQLite (manual)
CREATE INDEX idx_edges_source ON edges(source_id);
CREATE INDEX idx_edges_target ON edges(target_id);

Known Biases¶

Architectural Advantages¶

CongraphDB Advantages¶

Embedded - No network overhead
Rust Native - Compiled performance
Schema-Less - No migration overhead

Neo4j Advantages¶

Mature Optimizer - Advanced query planning
GDS Library - Native algorithm implementations
Caching - Page cache for hot data

Graphology Advantages¶

Pure JavaScript - No FFI overhead
In-Memory - No disk I/O for reads

SQLite Advantages¶

Battle-Tested - Decades of optimization
B-Trees - Efficient indexes

Mitigation Strategies¶

What We Do¶

Document Assumptions - All choices are explained
Invite Review - Open to feedback on methodology
Update Results - Incorporate community suggestions
Multiple Scenarios - Test different workload types

What We Don't Do¶

❌ Tune one engine extensively while leaving others default
❌ Use unrealistic queries that favor one engine
❌ Compare engines in inappropriate scenarios
❌ Hide methodology or cherry-pick results

Workload Selection¶

Why Citation Network?¶

Citation networks represent a realistic graph workload because:

Scale: Thousands to millions of nodes
Patterns: Power law degree distribution (like many real graphs)
Queries: k-hop traversals (most common graph operation)
Algorithms: PageRank (fundamental graph algorithm)

Alternative Workloads¶

We're considering additional workloads for future benchmarks:

Social Network: Friend recommendations
Knowledge Graph: Entity relationships
Fraud Detection: Transaction patterns
Supply Chain: Dependency networks

Data Freshness¶

Version Information¶

All results are tagged with: - Commit Hash: Exact source code version - Timestamp: When benchmark was run - Environment: Hardware and software specs

Reproducibility¶

# Reproduce exact results
git clone https://github.com/congraph-ai/congraphdb-benchmark.git
cd congraphdb-benchmark
git checkout <commit-hash>
npm install
npm run benchmark

Reporting Standards¶

Statistical Significance¶

We report: - Median - Less sensitive to outliers than mean - Runs - Minimum 3 runs per configuration - Confidence Intervals - 95% CI where applicable

Outlier Handling¶

Traversal: Median of 100 runs (excludes outliers)
Ingestion: Single run (representative of real use)
PageRank: Single run (deterministic algorithm)

Conflict of Interest¶

Disclosure¶

This benchmark is maintained by the CongraphDB team. To ensure fairness:

Open Source - All code is publicly available
Community Review - Issues and PRs welcome
Independent Verification - Anyone can run benchmarks
Competitive Inclusion - All major engines tested

Improvement Process¶

If you believe results are unfair:

Open an Issue - Describe the concern
Provide Evidence - Show the bias
Suggest Fix - Propose improvement
We'll Review - We'll test and update

Future Improvements¶

Planned Enhancements¶

Community Contributions¶

We welcome: - New engine adapters - Additional benchmark scenarios - Performance analysis - Documentation improvements

References¶

← Back to Methodology