Generating RDF Data from shapes#

This tutorial demonstrates how to use the Python bindings for rudof_generate to create synthetic RDF datasets from ShEx and SHACL schemas.

What You’ll Learn#

  • How to configure the data generator

  • How to generate RDF data from schemas

  • How to control randomness and reproducibility

  • How to customize output formats and options

  • How to use configuration files

Prerequisites#

Make sure you have pyrudof installed:

pip install pyrudof

Or install from source:

cd ../python
pip install -e .

Part 1: Basic Setup#

First, let’s import the library and check that everything is working.

# Import the pyrudof library
import pyrudof
import os
import json

print("✓ pyrudof imported successfully!")
print(f"Available classes: GeneratorConfig, DataGenerator")
print(f"Available enums: SchemaFormat, OutputFormat, CardinalityStrategy")
✓ pyrudof imported successfully!
Available classes: GeneratorConfig, DataGenerator
Available enums: SchemaFormat, OutputFormat, CardinalityStrategy

Part 2: Simple Example - Generate Data from a ShEx Schema#

Let’s start with a simple example. We’ll use a basic ShEx schema that defines Person and Course entities.

# First, let's create a simple ShEx schema
schema_content = """prefix : <http://example.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

:Person { 
    :name       xsd:string  ;
    :birthdate  xsd:date  ? ;
    :enrolledIn @:Course *
}

:Course { 
    :name xsd:string 
}
"""

# Save the schema to a file
schema_path = "/tmp/tutorial_schema.shex"
with open(schema_path, 'w') as f:
    f.write(schema_content)

print("✓ Schema created:")
print(schema_content)
✓ Schema created:
prefix : <http://example.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

:Person { 
    :name       xsd:string  ;
    :birthdate  xsd:date  ? ;
    :enrolledIn @:Course *
}

:Course { 
    :name xsd:string 
}
# Create a basic configuration
config = pyrudof.GeneratorConfig()

# Set the number of entities to generate
config.set_entity_count(10)

# Set output path
config.set_output_path("/tmp/tutorial_output.ttl")

# Set output format to Turtle (human-readable)
config.set_output_format(pyrudof.OutputFormat.Turtle)

print("✓ Configuration created")
print(f"  Entity count: {config.get_entity_count()}")
print(f"  Output path: {config.get_output_path()}")
✓ Configuration created
  Entity count: 10
  Output path: /tmp/tutorial_output.ttl
# Create the data generator
generator = pyrudof.DataGenerator(config)

# Generate the data!
generator.run(schema_path)

print("✓ Data generation completed!")
✓ Data generation completed!
# Let's view the generated data
with open("/tmp/tutorial_output.ttl", 'r') as f:
    generated_data = f.read()

print("Generated RDF Data:")
print("=" * 70)
print(generated_data[:1000])  # Show first 1000 characters
print("=" * 70)
Generated RDF Data:
======================================================================
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-2> , <http://example.org/Course-1-0> ;
	a <http://example.org/Person> ;
	<http://example.org/birthdate> "1996-04-07"^^<http://www.w3.org/2001/XMLSchema#date> ;
	<http://example.org/name> "Epsilon513" .
<http://example.org/Course-2> a <http://example.org/Course> ;
	<http://example.org/name> "Gamma523" .
<http://example.org/Course-4> a <http://example.org/Course> ;
	<http://example.org/name> "Beta961" .
<http://example.org/Person-1> a <http://example.org/Person> ;
	<http://example.org/name> "Alpha216" .
<http://example.org/Course-1> a <http://example.org/Course> ;
	<http://example.org/name> "Epsilon365" .
<http://example.org/Person-5> a <http://example.org/Person> ;
	<http://example.org/name> "Epsilon226" .
<http://example.org/Person-4> <http://example.org/enrolledIn> <http://example.org/Course-4> , <http://example.org/Course-3-0> ;
	a <http://example.org/Person> ;
	<http://example.org/birthdate> "20
======================================================================

Part 3: Advanced Configuration Options#

The generator supports many configuration options for fine-tuning data generation.

# Create a fully configured generator
advanced_config = pyrudof.GeneratorConfig()

# Generation settings
advanced_config.set_entity_count(20)
advanced_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)

# Output settings
advanced_config.set_output_path("/tmp/tutorial_advanced.ttl")
advanced_config.set_output_format(pyrudof.OutputFormat.Turtle)
advanced_config.set_write_stats(True)  # Generate statistics
advanced_config.set_compress(False)

# Parallel processing (for large datasets)
advanced_config.set_worker_threads(4)
advanced_config.set_batch_size(100)

print("✓ Advanced configuration:")
print(f"  Entities: {advanced_config.get_entity_count()}")
print(f"  Worker threads: {advanced_config.get_worker_threads()}")
print(f"  Statistics enabled: Yes")
✓ Advanced configuration:
  Entities: 20
  Worker threads: 4
  Statistics enabled: Yes
# Generate with advanced settings
advanced_generator = pyrudof.DataGenerator(advanced_config)
advanced_generator.run(schema_path)

print("✓ Data generated with advanced settings")
✓ Data generated with advanced settings
# View the statistics file
stats_path = "/tmp/tutorial_advanced.stats.json"
if os.path.exists(stats_path):
    with open(stats_path, 'r') as f:
        stats = json.load(f)
    
    print("📊 Generation Statistics:")
    print("=" * 70)
    print(json.dumps(stats, indent=2))
    print("=" * 70)
else:
    print("No statistics file found")
📊 Generation Statistics:
======================================================================
{
  "total_triples": 65,
  "total_subjects": 25,
  "total_predicates": 4,
  "total_objects": 42,
  "generation_time": "0ms",
  "shape_counts": {
    "http://example.org/Course": 15,
    "http://example.org/Person": 10
  }
}
======================================================================

Part 4: Cardinality Strategies#

Different strategies control how many related entities are created.

# Test different cardinality strategies
strategies = [
    ("Minimum", pyrudof.CardinalityStrategy.Minimum),
    ("Maximum", pyrudof.CardinalityStrategy.Maximum),
    ("Random", pyrudof.CardinalityStrategy.Random),
    ("Balanced", pyrudof.CardinalityStrategy.Balanced),
]

results = {}

for name, strategy in strategies:
    config = pyrudof.GeneratorConfig()
    config.set_entity_count(5)
    config.set_cardinality_strategy(strategy)
    config.set_output_path(f"/tmp/tutorial_{name.lower()}.ttl")
    config.set_output_format(pyrudof.OutputFormat.Turtle)
    config.set_write_stats(True)
    
    generator = pyrudof.DataGenerator(config)
    generator.run(schema_path)
    
    # Read statistics
    stats_path = f"/tmp/tutorial_{name.lower()}.stats.json"
    if os.path.exists(stats_path):
        with open(stats_path, 'r') as f:
            stats = json.load(f)
        results[name] = stats.get('total_triples', 0)

print("Cardinality Strategy Comparison:")
print("=" * 70)
for name, triple_count in results.items():
    print(f"  {name:12s}: {triple_count} triples generated")
print("=" * 70)
print("\nNote: Different strategies affect how many relationships are created.")
Cardinality Strategy Comparison:
======================================================================
  Minimum     : 10 triples generated
  Maximum     : 25 triples generated
  Random      : 15 triples generated
  Balanced    : 15 triples generated
======================================================================

Note: Different strategies affect how many relationships are created.

Custom Cardinality Configuration#

You can also specify custom cardinality ranges for specific properties in your schema. This gives you fine-grained control over how many related entities are generated for each relationship.

# Create a schema with specific cardinality constraints (using ShEx format)
custom_schema_content = """
PREFIX ex: <http://example.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

ex:User {
    ex:name xsd:string ;
    ex:email xsd:string ;
    ex:hasFriend @ex:User {2,5} ;   # Each user must have 2-5 friends
    ex:hasPost @ex:Post {0,10}       # Each user can have 0-10 posts
}

ex:Post {
    ex:title xsd:string ;
    ex:content xsd:string ;
    ex:hasLike @ex:User {0,50}       # Each post can have 0-50 likes
}
"""

# Save the custom schema (using .shex extension for ShEx format)
custom_schema_path = "/tmp/custom_cardinality_schema.shex"
with open(custom_schema_path, 'w') as f:
    f.write(custom_schema_content)

print("✓ Custom cardinality schema created")
print("\nKey cardinality constraints:")
print("  • Each User has 2-5 friends (hasFriend)")
print("  • Each User has 0-10 posts (hasPost)")
print("  • Each Post has 0-50 likes (hasLike)")
✓ Custom cardinality schema created

Key cardinality constraints:
  • Each User has 2-5 friends (hasFriend)
  • Each User has 0-10 posts (hasPost)
  • Each Post has 0-50 likes (hasLike)
# Generate data with different strategies to see how they respect cardinality constraints
custom_strategies = [
    ("Minimum", pyrudof.CardinalityStrategy.Minimum),
    ("Maximum", pyrudof.CardinalityStrategy.Maximum),
    ("Balanced", pyrudof.CardinalityStrategy.Balanced),
]

print("Generating data with custom cardinality constraints:")
print("=" * 80)

for name, strategy in custom_strategies:
    config = pyrudof.GeneratorConfig()
    config.set_entity_count(5)
    config.set_cardinality_strategy(strategy)
    config.set_output_path(f"/tmp/custom_cardinality_{name.lower()}.ttl")
    config.set_output_format(pyrudof.OutputFormat.Turtle)
    config.set_write_stats(True)
    
    generator = pyrudof.DataGenerator(config)
    generator.run(custom_schema_path)
    
    # Read and analyze the generated data
    with open(f"/tmp/custom_cardinality_{name.lower()}.ttl", 'r') as f:
        data = f.read()
    
    hasFriend_count = data.count('hasFriend')
    hasPost_count = data.count('hasPost')
    hasLike_count = data.count('hasLike')
    
    print(f"\n{name} Strategy:")
    print(f"  hasFriend relationships: {hasFriend_count}")
    print(f"  hasPost relationships: {hasPost_count}")
    print(f"  hasLike relationships: {hasLike_count}")

print("\n" + "=" * 80)
print("\nObservations:")
print("  • Minimum strategy: Uses minCount values (2 friends, 0 posts, 0 likes)")
print("  • Maximum strategy: Uses maxCount values (5 friends, 10 posts, 50 likes)")
print("  • Balanced strategy: Picks values between min and max")
Generating data with custom cardinality constraints:
================================================================================

Minimum Strategy:
  hasFriend relationships: 2
  hasPost relationships: 0
  hasLike relationships: 0

Maximum Strategy:
  hasFriend relationships: 3
  hasPost relationships: 3
  hasLike relationships: 2

Balanced Strategy:
  hasFriend relationships: 3
  hasPost relationships: 2
  hasLike relationships: 1

================================================================================

Observations:
  • Minimum strategy: Uses minCount values (2 friends, 0 posts, 0 likes)
  • Maximum strategy: Uses maxCount values (5 friends, 10 posts, 50 likes)
  • Balanced strategy: Picks values between min and max
# Let's examine a sample of the generated data with Balanced strategy
print("Sample of generated data with Balanced strategy:")
print("=" * 80)

with open("/tmp/custom_cardinality_balanced.ttl", 'r') as f:
    content = f.read()
    # Show first 800 characters to see the structure
    print(content[:800])
    print("...")

print("=" * 80)
print("\nYou can see how the generator respects the cardinality constraints")
print("defined in the schema while following the chosen strategy.")
Sample of generated data with Balanced strategy:
================================================================================
<http://example.org/User-1-0> a <http://example.org/User> ;
	<http://example.org/name> "Delta850" , "Delta142" ;
	<http://example.org/email> "test38@sample.edu" , "admin64@test.org" .
<http://example.org/User-2-0> a <http://example.org/User> ;
	<http://example.org/name> "Beta817" ;
	<http://example.org/email> "admin79@test.org" .
<http://example.org/User-3> <http://example.org/hasPost> <http://example.org/Post-1> , <http://example.org/Post-2-1> , <http://example.org/Post-2> , <http://example.org/Post-2-0> ;
	<http://example.org/hasFriend> <http://example.org/User-2-0> , <http://example.org/User-3> , <http://example.org/User-2-1> , <http://example.org/User-2-2> , <http://example.org/User-1> , <http://example.org/User-2-3> ;
	a <http://example.org/User> ;
	<http://example.org/name> "Alpha913
...
================================================================================

You can see how the generator respects the cardinality constraints
defined in the schema while following the chosen strategy.

Part 5: Working with Configuration Files#

For complex configurations, you can save and load settings from TOML files.

# Create a configuration
config = pyrudof.GeneratorConfig()
config.set_entity_count(100)
config.set_output_path("/tmp/from_config.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
config.set_write_stats(True)
config.set_worker_threads(4)

# Save to TOML file
config_file = "/tmp/generator_config.toml"
config.to_toml_file(config_file)

print("✓ Configuration saved to TOML file")
print("\nConfiguration file contents:")
print("=" * 70)
with open(config_file, 'r') as f:
    print(f.read())
print("=" * 70)
✓ Configuration saved to TOML file

Configuration file contents:
======================================================================
[generation]
entity_count = 100
entity_distribution = "Equal"
cardinality_strategy = "Balanced"

[field_generators.default]
locale = "en"
quality = "Medium"

[field_generators.datatypes]

[field_generators.properties]

[output]
path = "/tmp/from_config.ttl"
format = "Turtle"
compress = false
write_stats = true
parallel_writing = false
parallel_file_count = 0

[parallel]
worker_threads = 4
batch_size = 100
parallel_shapes = true
parallel_fields = true

======================================================================
# Load configuration from file
loaded_config = pyrudof.GeneratorConfig.from_toml_file(config_file)

print("✓ Configuration loaded from file")
print(f"  Entity count: {loaded_config.get_entity_count()}")
print(f"  Output path: {loaded_config.get_output_path()}")

# You can override settings after loading
loaded_config.set_entity_count(50)
loaded_config.set_output_path("/tmp/from_config_modified.ttl")

print("\n✓ Configuration modified")
print(f"  New entity count: {loaded_config.get_entity_count()}")
✓ Configuration loaded from file
  Entity count: 100
  Output path: /tmp/from_config.ttl

✓ Configuration modified
  New entity count: 50

Part 7: Different Output Formats#

The generator supports multiple RDF output formats.

# Generate in Turtle format
config_turtle = pyrudof.GeneratorConfig()
config_turtle.set_entity_count(3)
config_turtle.set_output_path("/tmp/tutorial_output.ttl")
config_turtle.set_output_format(pyrudof.OutputFormat.Turtle)

generator_turtle = pyrudof.DataGenerator(config_turtle)
generator_turtle.run(schema_path)

print("Turtle Format:")
print("=" * 70)
with open("/tmp/tutorial_output.ttl", 'r') as f:
    print(f.read()[:500])
print("...")
print("=" * 70)
Turtle Format:
======================================================================
<http://example.org/Person-1> <http://example.org/name> "Delta274" ;
	a <http://example.org/Person> .
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-1-0> , <http://example.org/Course-1> ;
	<http://example.org/name> "Epsilon282" ;
	<http://example.org/birthdate> "1953-05-09"^^<http://www.w3.org/2001/XMLSchema#date> ;
	a <http://example.org/Person> .
<http://example.org/Course-1-0> <http://example.org/name> "Gamma102" ;
	a <http://example.org/Course> .
<ht
...
======================================================================
# Generate in NTriples format
config_ntriples = pyrudof.GeneratorConfig()
config_ntriples.set_entity_count(3)
config_ntriples.set_output_path("/tmp/tutorial_output.nt")
config_ntriples.set_output_format(pyrudof.OutputFormat.NTriples)

generator_ntriples = pyrudof.DataGenerator(config_ntriples)
generator_ntriples.run(schema_path)

print("NTriples Format:")
print("=" * 70)
with open("/tmp/tutorial_output.nt", 'r') as f:
    print(f.read()[:500])
print("...")
print("=" * 70)
NTriples Format:
======================================================================
<http://example.org/Course-1> <http://example.org/name> "Delta880" .
<http://example.org/Course-1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Course> .
<http://example.org/Person-2> <http://example.org/birthdate> "2010-08-07"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://example.org/Person-2> <http://example.org/name> "Delta702" .
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-1> .
<http://example.org/Person-2> <http://exa
...
======================================================================
# Create a simple SHACL schema
shacl_schema = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:PersonShape
    a sh:NodeShape ;
    sh:targetClass ex:Person ;
    sh:property [
        sh:path ex:name ;
        sh:datatype xsd:string ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path ex:age ;
        sh:datatype xsd:integer ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
    ] .
"""

# Save SHACL schema
shacl_path = "/tmp/tutorial_schema.shacl.ttl"
with open(shacl_path, 'w') as f:
    f.write(shacl_schema)

print("✓ SHACL schema created")

# Generate data from SHACL schema
shacl_config = pyrudof.GeneratorConfig()
shacl_config.set_entity_count(10)
shacl_config.set_output_path("/tmp/tutorial_shacl_output.ttl")
shacl_config.set_output_format(pyrudof.OutputFormat.Turtle)

shacl_generator = pyrudof.DataGenerator(shacl_config)

# Explicitly load as SHACL
shacl_generator.load_shacl_schema(shacl_path)
shacl_generator.generate()

print("✓ Data generated from SHACL schema")

# View the generated data
with open("/tmp/tutorial_shacl_output.ttl", 'r') as f:
    shacl_output = f.read()

print("\nGenerated data from SHACL:")
print("=" * 70)
print(shacl_output[:500])
print("...")
print("=" * 70)
✓ SHACL schema created
✓ Data generated from SHACL schema

Generated data from SHACL:
======================================================================
<http://example.org/PersonShape-3> a <http://example.org/PersonShape> ;
	<http://example.org/name> "Beta650" ;
	<http://example.org/age> 6884 .
<http://example.org/PersonShape-5> a <http://example.org/PersonShape> ;
	<http://example.org/name> "Gamma525" ;
	<http://example.org/age> 8054 .
<http://example.org/PersonShape-8> a <http://example.org/PersonShape> ;
	<http://example.org/name> "Alpha384" ;
	<http://example.org/age> 4237 .
<http://example.org/PersonShape-7> a <http://example.org/PersonSha
...
======================================================================

Working with SHACL Schemas#

The generator also supports SHACL schemas in addition to ShEx.

# Create a fully configured generator and inspect it
inspect_config = pyrudof.GeneratorConfig()
inspect_config.set_entity_count(50)
inspect_config.set_output_path("/tmp/inspect_output.ttl")
inspect_config.set_output_format(pyrudof.OutputFormat.Turtle)
inspect_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
inspect_config.set_worker_threads(4)
inspect_config.set_batch_size(100)
inspect_config.set_write_stats(True)
inspect_config.set_compress(False)

print("Configuration Overview:")
print("=" * 80)
print(inspect_config.show())
print("=" * 80)

print("\nYou can use show() to debug configuration issues or document your setup.")
Configuration Overview:
================================================================================
GeneratorConfig { generation: GenerationConfig { entity_count: 50, seed: None, entity_distribution: Equal, cardinality_strategy: Balanced, schema_format: None }, field_generators: FieldGeneratorConfig { default: DefaultFieldConfig { locale: "en", quality: Medium }, datatypes: {}, properties: {} }, output: OutputConfig { path: "/tmp/inspect_output.ttl", format: Turtle, compress: false, write_stats: true, parallel_writing: false, parallel_file_count: 0 }, parallel: ParallelConfig { worker_threads: Some(4), batch_size: 100, parallel_shapes: true, parallel_fields: true } }
================================================================================

You can use show() to debug configuration issues or document your setup.

Inspecting Configuration#

You can view the current configuration settings using the show() method.

# Method 1: Explicit ShEx loading
explicit_config = pyrudof.GeneratorConfig()
explicit_config.set_entity_count(5)
explicit_config.set_output_path("/tmp/tutorial_explicit_shex.ttl")
explicit_config.set_output_format(pyrudof.OutputFormat.Turtle)

generator_explicit = pyrudof.DataGenerator(explicit_config)

# Explicitly load ShEx schema
generator_explicit.load_shex_schema(schema_path)

# Then generate data
generator_explicit.generate()

print("✓ Method 1: Explicit ShEx loading and generation completed")

# Method 2: Using run_with_format for explicit format specification
format_config = pyrudof.GeneratorConfig()
format_config.set_entity_count(5)
format_config.set_output_path("/tmp/tutorial_with_format.ttl")
format_config.set_output_format(pyrudof.OutputFormat.Turtle)

generator_format = pyrudof.DataGenerator(format_config)
generator_format.run_with_format(schema_path, pyrudof.SchemaFormat.ShEx)

print("✓ Method 2: run_with_format completed")

# Method 3: Auto-detect (this is what run() does internally)
auto_config = pyrudof.GeneratorConfig()
auto_config.set_entity_count(5)
auto_config.set_output_path("/tmp/tutorial_auto.ttl")
auto_config.set_output_format(pyrudof.OutputFormat.Turtle)

generator_auto = pyrudof.DataGenerator(auto_config)
generator_auto.load_schema_auto(schema_path)
generator_auto.generate()

print("✓ Method 3: Auto-detect and generation completed")

print("\nAll three methods produce equivalent results!")
✓ Method 1: Explicit ShEx loading and generation completed
✓ Method 2: run_with_format completed
✓ Method 3: Auto-detect and generation completed

All three methods produce equivalent results!

Explicit Schema Loading Methods#

Instead of using the auto-detect run() method, you can explicitly load and generate in separate steps.

# Create a configuration and save it as JSON
json_config = pyrudof.GeneratorConfig()
json_config.set_entity_count(25)
json_config.set_output_path("/tmp/from_json.ttl")
json_config.set_output_format(pyrudof.OutputFormat.Turtle)
json_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Random)
json_config.set_worker_threads(2)

# Note: JSON export might not be implemented yet, let's try
# If it fails, we'll create JSON manually
try:
    json_config_file = "/tmp/generator_config.json"
    # Manually create a JSON config file based on TOML structure
    import json
    
    config_data = {
        "entity_count": json_config.get_entity_count(),
        "output_path": json_config.get_output_path(),
        "worker_threads": json_config.get_worker_threads()
    }
    
    with open(json_config_file, 'w') as f:
        json.dump(config_data, f, indent=2)
    
    print("✓ JSON configuration file created:")
    print("=" * 70)
    with open(json_config_file, 'r') as f:
        print(f.read())
    print("=" * 70)
    
    # Try to load it back
    try:
        loaded_json_config = pyrudof.GeneratorConfig.from_json_file(json_config_file)
        print("\n✓ Configuration loaded from JSON")
        print(f"  Entity count: {loaded_json_config.get_entity_count()}")
    except Exception as e:
        print(f"\n⚠️  JSON loading not yet fully implemented: {e}")
        print("   (This is a known limitation)")
        
except Exception as e:
    print(f"⚠️  JSON configuration: {e}")
✓ JSON configuration file created:
======================================================================
{
  "entity_count": 25,
  "output_path": "/tmp/from_json.ttl",
  "worker_threads": 2
}
======================================================================

⚠️  JSON loading not yet fully implemented: JSON parsing error: missing field `generation` at line 5 column 1
   (This is a known limitation)

Working with JSON Configuration Files#

Besides TOML, you can also use JSON configuration files.

# Configure parallel writing (useful for very large datasets)
parallel_config = pyrudof.GeneratorConfig()
parallel_config.set_entity_count(20)
parallel_config.set_output_path("/tmp/tutorial_parallel.ttl")
parallel_config.set_output_format(pyrudof.OutputFormat.Turtle)

# Enable parallel writing and split across 3 files
parallel_config.set_parallel_writing(True)
parallel_config.set_parallel_file_count(3)

print("✓ Parallel writing configuration:")
print(f"  Entity count: {parallel_config.get_entity_count()}")
print(f"  Parallel writing: Enabled")
print(f"  Output files: 3")

# Generate data
parallel_generator = pyrudof.DataGenerator(parallel_config)
parallel_generator.run(schema_path)

print("\n✓ Data generated with parallel writing")
print("\nGenerated files:")
import glob
parallel_files = glob.glob("/tmp/tutorial_parallel*.ttl")
for f in parallel_files:
    size = os.path.getsize(f)
    print(f"  {os.path.basename(f)}: {size} bytes")
✓ Parallel writing configuration:
  Entity count: 20
  Parallel writing: Enabled
  Output files: 3

✓ Data generated with parallel writing

Generated files:
  tutorial_parallel_part_001.ttl: 1199 bytes
  tutorial_parallel_part_003.ttl: 1104 bytes
  tutorial_parallel_part_002.ttl: 1224 bytes

Entity Distribution#

Control the distribution of entity types in your generated dataset. This is useful when you need specific proportions of different entities for realistic testing scenarios.

# Configure entity distribution
# Control how many entities of each type are generated
distribution_config = pyrudof.GeneratorConfig()
distribution_config.set_entity_count(30)
distribution_config.set_output_path("/tmp/tutorial_distribution.ttl")
distribution_config.set_output_format(pyrudof.OutputFormat.Turtle)
distribution_config.set_write_stats(True)

# Set specific distribution: 70% Persons, 30% Courses
# This gives more control over the shape of your generated data
distribution_config.set_entity_distribution({
    "http://example.org/Person": 0.7,
    "http://example.org/Course": 0.3
})

print("✓ Entity distribution configuration:")
print(f"  Total entities: {distribution_config.get_entity_count()}")
print(f"  Distribution: 70% Person, 30% Course")
print(f"  Expected: ~21 Persons, ~9 Courses")

# Generate data with custom distribution
distribution_generator = pyrudof.DataGenerator(distribution_config)
distribution_generator.run(schema_path)

print("\n✓ Data generated with custom entity distribution")

# Verify the distribution in the generated data
with open("/tmp/tutorial_distribution.ttl", 'r') as f:
    distribution_output = f.read()

person_count = distribution_output.count('a <http://example.org/Person>')
course_count = distribution_output.count('a <http://example.org/Course>')

print(f"\nActual distribution:")
print(f"  Persons: {person_count}")
print(f"  Courses: {course_count}")
print(f"  Ratio: {person_count/(person_count+course_count):.1%} / {course_count/(person_count+course_count):.1%}")

# View statistics
stats_path = "/tmp/tutorial_distribution.stats.json"
if os.path.exists(stats_path):
    with open(stats_path, 'r') as f:
        stats = json.load(f)
    print(f"\n📊 Total triples generated: {stats.get('total_triples', 0)}")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 11
      7 distribution_config.set_write_stats(True)
      9 # Set specific distribution: 70% Persons, 30% Courses
     10 # This gives more control over the shape of your generated data
---> 11 distribution_config.set_entity_distribution({
     12     "http://example.org/Person": 0.7,
     13     "http://example.org/Course": 0.3
     14 })
     16 print("✓ Entity distribution configuration:")
     17 print(f"  Total entities: {distribution_config.get_entity_count()}")

TypeError: argument 'distribution': 'dict' object cannot be converted to 'EntityDistribution'

Parallel Writing for Large Datasets#

When generating very large datasets, you can split the output across multiple files for better performance.

Part 6: Advanced Features#

Let’s explore some additional advanced features that haven’t been covered yet.

Summary#

In this tutorial, you learned:

Core Features#

  1. Basic Setup - Import and initialize the pyrudof library

  2. Simple Data Generation - Generate RDF data from ShEx schemas

  3. Advanced Configuration - Fine-tune generation with multiple options

  4. Cardinality Strategies - Control relationship density (Minimum, Maximum, Random, Balanced)

  5. Configuration Files - Save and load settings from TOML files

Advanced Features#

  1. Output Formats - Generate data in Turtle, NTriples, and other RDF formats

  2. SHACL Schema Support - Generate data from SHACL schemas

  3. Configuration Inspection - View and debug configuration settings with show()

  4. Explicit Schema Loading - Load schemas explicitly (ShEx/SHACL) or auto-detect

  5. JSON Configuration - Work with JSON configuration files

  6. Parallel Writing - Split large datasets across multiple files

  7. Entity Distribution - Control proportions of different entity types

Next Steps#

  • Try with your own ShEx or SHACL schemas

  • Generate larger datasets for testing

  • Experiment with parallel processing and parallel writing settings

  • Integrate into your data pipelines

  • Validate generated data against schemas

  • Test different cardinality strategies for your use case

Resources#

# Clean up temporary files (optional)
import glob

temp_files = glob.glob("/tmp/tutorial_*")
print(f"Generated {len(temp_files)} temporary files during this tutorial")
print("\nTo clean up, you can remove files in /tmp/tutorial_*")
Generated 18 temporary files during this tutorial

To clean up, you can remove files in /tmp/tutorial_*