Skip to content

Dataset Specification

Detailed information about the synthetic citation network dataset used in benchmarks.


Overview

The benchmark uses a synthetic academic citation network that mimics real-world patterns found in databases like: - Cora - Machine learning papers - CiteSeer - Scientific publications - PubMed - Biomedical literature - DBLP - Computer science bibliography


Data Model

Node Schema (Papers)

Field Type Description Example
id String Unique identifier "paper_000001"
title String Paper title "Attention Is All You Need"
year Integer Publication year 2017
venue String Conference/Journal "NeurIPS"
authors Array Author names ["Author One", "Author Two"]

Edge Schema (Citations)

Field Type Description Example
source String Citing paper ID "paper_000001"
target String Cited paper ID "paper_000042"
label String Edge type "CITES"

Generation Algorithm

Node Generation

function generateNodes(count) {
  const venues = powerLawDistribution([
    { name: 'NeurIPS', weight: 100 },
    { name: 'ICML', weight: 80 },
    { name: 'ACL', weight: 60 },
    { name: 'EMNLP', weight: 50 },
    // ... 50 total venues
  ]);

  const nodes = [];
  for (let i = 0; i < count; i++) {
    nodes.push({
      id: `paper_${String(i).padStart(6, '0')}`,
      title: generateTitle(),
      year: randomInt(1990, 2024),
      venue: weightedRandom(venues),
      authors: randomArray(1, 5, () => generateAuthor())
    });
  }
  return nodes;
}

Edge Generation (Preferential Attachment)

function generateEdges(nodes, edgesPerNode) {
  const graph = new Map();
  nodes.forEach(n => graph.set(n.id, new Set()));

  const edges = [];
  for (const node of nodes) {
    // Preferential attachment: cite popular papers more often
    const targets = preferentialSelection(
      nodes,
      edgesPerNode,
      graph
    );

    for (const target of targets) {
      if (target.id !== node.id) {
        edges.push({
          source: node.id,
          target: target.id,
          label: 'CITES'
        });
        graph.get(node.id).add(target.id);
      }
    }
  }
  return edges;
}

Statistical Properties

Degree Distribution

The generated network follows a power law distribution, similar to real citation networks:

P(k) ~ k^(-γ)
where γ ≈ 2.5

This means: - Few papers have many citations (hubs) - Most papers have few citations - Rich-get-richer phenomenon

Clustering Coefficient

Metric Value
Average Clustering ~0.15
Transitivity ~0.08

Connected Components

Metric Small Medium Large
Largest Component ~95% ~98% ~99%
Avg Path Length ~4.2 ~4.5 ~4.8

Scale Variants

Small Dataset

{
  "nodes": 10_000,
  "edges": 50_000,
  "avg_degree": 5.0,
  "file_size": "~5 MB (JSON)"
}

Use Cases: - Unit testing - Quick experiments - CI/CD pipelines - Development

Medium Dataset

{
  "nodes": 100_000,
  "edges": 1_000_000,
  "avg_degree": 10.0,
  "file_size": "~80 MB (JSON)"
}

Use Cases: - Standard benchmarks - Performance testing - Comparison evaluation

Large Dataset

{
  "nodes": 1_000_000,
  "edges": 10_000_000,
  "avg_degree": 10.0,
  "file_size": "~1.2 GB (JSON)"
}

Use Cases: - Stress testing - Production simulation - Scalability analysis


Real-World Validation

The synthetic dataset was validated against real citation networks:

Metric Synthetic Cora PubMed DBLP
Nodes 100K 2.7K 19K 4.6M
Avg Degree 10 4.5 7.2 6.8
Clustering 0.15 0.24 0.18 0.14
Power Law Exponent 2.5 2.4 2.6 2.5

Data Format

JSON Format

{
  "nodes": [
    {
      "id": "paper_000001",
      "title": "A Novel Approach to Graph Databases",
      "year": 2023,
      "venue": "NeurIPS",
      "authors": ["Jane Doe", "John Smith"]
    }
  ],
  "edges": [
    {
      "source": "paper_000002",
      "target": "paper_000001",
      "label": "CITES"
    }
  ]
}

CSV Format

nodes.csv:

id,title,year,venue,authors
paper_000001,"A Novel Approach",2023,NeurIPS,"Jane Doe|John Smith"

edges.csv:

source,target,label
paper_000002,paper_000001,CITES


Generator Code

The dataset generator is included in the benchmark repository:

# Generate small dataset
node src/data/generator.js --size small --output data/small.json

# Generate medium dataset
node src/data/generator.js --size medium --output data/medium.json

# Generate large dataset
node src/data/generator.js --size large --output data/large.json

# Custom parameters
node src/data/generator.js \
  --nodes 50000 \
  --edges 500000 \
  --output data/custom.json

References


← Back to Methodology