Generating RDF Data from shapes#
This tutorial demonstrates how to use the Python bindings for rudof_generate to create synthetic RDF datasets from ShEx and SHACL schemas.
What You’ll Learn#
How to configure the data generator
How to generate RDF data from schemas
How to control randomness and reproducibility
How to customize output formats and options
How to use configuration files
Prerequisites#
Make sure you have pyrudof installed:
pip install pyrudof
Or install from source:
cd ../python
pip install -e .
Part 1: Basic Setup#
First, let’s import the library and check that everything is working.
# Import the pyrudof library
import pyrudof
import os
import json
print("✓ pyrudof imported successfully!")
print(f"Available classes: GeneratorConfig, DataGenerator")
print(f"Available enums: SchemaFormat, OutputFormat, CardinalityStrategy")
✓ pyrudof imported successfully!
Available classes: GeneratorConfig, DataGenerator
Available enums: SchemaFormat, OutputFormat, CardinalityStrategy
Part 2: Simple Example - Generate Data from a ShEx Schema#
Let’s start with a simple example. We’ll use a basic ShEx schema that defines Person and Course entities.
# First, let's create a simple ShEx schema
schema_content = """prefix : <http://example.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
:Person {
:name xsd:string ;
:birthdate xsd:date ? ;
:enrolledIn @:Course *
}
:Course {
:name xsd:string
}
"""
# Save the schema to a file
schema_path = "/tmp/tutorial_schema.shex"
with open(schema_path, 'w') as f:
f.write(schema_content)
print("✓ Schema created:")
print(schema_content)
✓ Schema created:
prefix : <http://example.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
:Person {
:name xsd:string ;
:birthdate xsd:date ? ;
:enrolledIn @:Course *
}
:Course {
:name xsd:string
}
# Create a basic configuration
config = pyrudof.GeneratorConfig()
# Set the number of entities to generate
config.set_entity_count(10)
# Set output path
config.set_output_path("/tmp/tutorial_output.ttl")
# Set output format to Turtle (human-readable)
config.set_output_format(pyrudof.OutputFormat.Turtle)
print("✓ Configuration created")
print(f" Entity count: {config.get_entity_count()}")
print(f" Output path: {config.get_output_path()}")
✓ Configuration created
Entity count: 10
Output path: /tmp/tutorial_output.ttl
# Create the data generator
generator = pyrudof.DataGenerator(config)
# Generate the data!
generator.run(schema_path)
print("✓ Data generation completed!")
✓ Data generation completed!
# Let's view the generated data
with open("/tmp/tutorial_output.ttl", 'r') as f:
generated_data = f.read()
print("Generated RDF Data:")
print("=" * 70)
print(generated_data[:1000]) # Show first 1000 characters
print("=" * 70)
Generated RDF Data:
======================================================================
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-2> , <http://example.org/Course-1-0> ;
a <http://example.org/Person> ;
<http://example.org/birthdate> "1996-04-07"^^<http://www.w3.org/2001/XMLSchema#date> ;
<http://example.org/name> "Epsilon513" .
<http://example.org/Course-2> a <http://example.org/Course> ;
<http://example.org/name> "Gamma523" .
<http://example.org/Course-4> a <http://example.org/Course> ;
<http://example.org/name> "Beta961" .
<http://example.org/Person-1> a <http://example.org/Person> ;
<http://example.org/name> "Alpha216" .
<http://example.org/Course-1> a <http://example.org/Course> ;
<http://example.org/name> "Epsilon365" .
<http://example.org/Person-5> a <http://example.org/Person> ;
<http://example.org/name> "Epsilon226" .
<http://example.org/Person-4> <http://example.org/enrolledIn> <http://example.org/Course-4> , <http://example.org/Course-3-0> ;
a <http://example.org/Person> ;
<http://example.org/birthdate> "20
======================================================================
Part 3: Advanced Configuration Options#
The generator supports many configuration options for fine-tuning data generation.
# Create a fully configured generator
advanced_config = pyrudof.GeneratorConfig()
# Generation settings
advanced_config.set_entity_count(20)
advanced_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
# Output settings
advanced_config.set_output_path("/tmp/tutorial_advanced.ttl")
advanced_config.set_output_format(pyrudof.OutputFormat.Turtle)
advanced_config.set_write_stats(True) # Generate statistics
advanced_config.set_compress(False)
# Parallel processing (for large datasets)
advanced_config.set_worker_threads(4)
advanced_config.set_batch_size(100)
print("✓ Advanced configuration:")
print(f" Entities: {advanced_config.get_entity_count()}")
print(f" Worker threads: {advanced_config.get_worker_threads()}")
print(f" Statistics enabled: Yes")
✓ Advanced configuration:
Entities: 20
Worker threads: 4
Statistics enabled: Yes
# Generate with advanced settings
advanced_generator = pyrudof.DataGenerator(advanced_config)
advanced_generator.run(schema_path)
print("✓ Data generated with advanced settings")
✓ Data generated with advanced settings
# View the statistics file
stats_path = "/tmp/tutorial_advanced.stats.json"
if os.path.exists(stats_path):
with open(stats_path, 'r') as f:
stats = json.load(f)
print("📊 Generation Statistics:")
print("=" * 70)
print(json.dumps(stats, indent=2))
print("=" * 70)
else:
print("No statistics file found")
📊 Generation Statistics:
======================================================================
{
"total_triples": 65,
"total_subjects": 25,
"total_predicates": 4,
"total_objects": 42,
"generation_time": "0ms",
"shape_counts": {
"http://example.org/Course": 15,
"http://example.org/Person": 10
}
}
======================================================================
Part 4: Cardinality Strategies#
Different strategies control how many related entities are created.
# Test different cardinality strategies
strategies = [
("Minimum", pyrudof.CardinalityStrategy.Minimum),
("Maximum", pyrudof.CardinalityStrategy.Maximum),
("Random", pyrudof.CardinalityStrategy.Random),
("Balanced", pyrudof.CardinalityStrategy.Balanced),
]
results = {}
for name, strategy in strategies:
config = pyrudof.GeneratorConfig()
config.set_entity_count(5)
config.set_cardinality_strategy(strategy)
config.set_output_path(f"/tmp/tutorial_{name.lower()}.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_write_stats(True)
generator = pyrudof.DataGenerator(config)
generator.run(schema_path)
# Read statistics
stats_path = f"/tmp/tutorial_{name.lower()}.stats.json"
if os.path.exists(stats_path):
with open(stats_path, 'r') as f:
stats = json.load(f)
results[name] = stats.get('total_triples', 0)
print("Cardinality Strategy Comparison:")
print("=" * 70)
for name, triple_count in results.items():
print(f" {name:12s}: {triple_count} triples generated")
print("=" * 70)
print("\nNote: Different strategies affect how many relationships are created.")
Cardinality Strategy Comparison:
======================================================================
Minimum : 10 triples generated
Maximum : 25 triples generated
Random : 15 triples generated
Balanced : 15 triples generated
======================================================================
Note: Different strategies affect how many relationships are created.
Custom Cardinality Configuration#
You can also specify custom cardinality ranges for specific properties in your schema. This gives you fine-grained control over how many related entities are generated for each relationship.
# Create a schema with specific cardinality constraints (using ShEx format)
custom_schema_content = """
PREFIX ex: <http://example.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ex:User {
ex:name xsd:string ;
ex:email xsd:string ;
ex:hasFriend @ex:User {2,5} ; # Each user must have 2-5 friends
ex:hasPost @ex:Post {0,10} # Each user can have 0-10 posts
}
ex:Post {
ex:title xsd:string ;
ex:content xsd:string ;
ex:hasLike @ex:User {0,50} # Each post can have 0-50 likes
}
"""
# Save the custom schema (using .shex extension for ShEx format)
custom_schema_path = "/tmp/custom_cardinality_schema.shex"
with open(custom_schema_path, 'w') as f:
f.write(custom_schema_content)
print("✓ Custom cardinality schema created")
print("\nKey cardinality constraints:")
print(" • Each User has 2-5 friends (hasFriend)")
print(" • Each User has 0-10 posts (hasPost)")
print(" • Each Post has 0-50 likes (hasLike)")
✓ Custom cardinality schema created
Key cardinality constraints:
• Each User has 2-5 friends (hasFriend)
• Each User has 0-10 posts (hasPost)
• Each Post has 0-50 likes (hasLike)
# Generate data with different strategies to see how they respect cardinality constraints
custom_strategies = [
("Minimum", pyrudof.CardinalityStrategy.Minimum),
("Maximum", pyrudof.CardinalityStrategy.Maximum),
("Balanced", pyrudof.CardinalityStrategy.Balanced),
]
print("Generating data with custom cardinality constraints:")
print("=" * 80)
for name, strategy in custom_strategies:
config = pyrudof.GeneratorConfig()
config.set_entity_count(5)
config.set_cardinality_strategy(strategy)
config.set_output_path(f"/tmp/custom_cardinality_{name.lower()}.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_write_stats(True)
generator = pyrudof.DataGenerator(config)
generator.run(custom_schema_path)
# Read and analyze the generated data
with open(f"/tmp/custom_cardinality_{name.lower()}.ttl", 'r') as f:
data = f.read()
hasFriend_count = data.count('hasFriend')
hasPost_count = data.count('hasPost')
hasLike_count = data.count('hasLike')
print(f"\n{name} Strategy:")
print(f" hasFriend relationships: {hasFriend_count}")
print(f" hasPost relationships: {hasPost_count}")
print(f" hasLike relationships: {hasLike_count}")
print("\n" + "=" * 80)
print("\nObservations:")
print(" • Minimum strategy: Uses minCount values (2 friends, 0 posts, 0 likes)")
print(" • Maximum strategy: Uses maxCount values (5 friends, 10 posts, 50 likes)")
print(" • Balanced strategy: Picks values between min and max")
Generating data with custom cardinality constraints:
================================================================================
Minimum Strategy:
hasFriend relationships: 2
hasPost relationships: 0
hasLike relationships: 0
Maximum Strategy:
hasFriend relationships: 3
hasPost relationships: 3
hasLike relationships: 2
Balanced Strategy:
hasFriend relationships: 3
hasPost relationships: 2
hasLike relationships: 1
================================================================================
Observations:
• Minimum strategy: Uses minCount values (2 friends, 0 posts, 0 likes)
• Maximum strategy: Uses maxCount values (5 friends, 10 posts, 50 likes)
• Balanced strategy: Picks values between min and max
# Let's examine a sample of the generated data with Balanced strategy
print("Sample of generated data with Balanced strategy:")
print("=" * 80)
with open("/tmp/custom_cardinality_balanced.ttl", 'r') as f:
content = f.read()
# Show first 800 characters to see the structure
print(content[:800])
print("...")
print("=" * 80)
print("\nYou can see how the generator respects the cardinality constraints")
print("defined in the schema while following the chosen strategy.")
Sample of generated data with Balanced strategy:
================================================================================
<http://example.org/User-1-0> a <http://example.org/User> ;
<http://example.org/name> "Delta850" , "Delta142" ;
<http://example.org/email> "test38@sample.edu" , "admin64@test.org" .
<http://example.org/User-2-0> a <http://example.org/User> ;
<http://example.org/name> "Beta817" ;
<http://example.org/email> "admin79@test.org" .
<http://example.org/User-3> <http://example.org/hasPost> <http://example.org/Post-1> , <http://example.org/Post-2-1> , <http://example.org/Post-2> , <http://example.org/Post-2-0> ;
<http://example.org/hasFriend> <http://example.org/User-2-0> , <http://example.org/User-3> , <http://example.org/User-2-1> , <http://example.org/User-2-2> , <http://example.org/User-1> , <http://example.org/User-2-3> ;
a <http://example.org/User> ;
<http://example.org/name> "Alpha913
...
================================================================================
You can see how the generator respects the cardinality constraints
defined in the schema while following the chosen strategy.
Part 5: Working with Configuration Files#
For complex configurations, you can save and load settings from TOML files.
# Create a configuration
config = pyrudof.GeneratorConfig()
config.set_entity_count(100)
config.set_output_path("/tmp/from_config.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
config.set_write_stats(True)
config.set_worker_threads(4)
# Save to TOML file
config_file = "/tmp/generator_config.toml"
config.to_toml_file(config_file)
print("✓ Configuration saved to TOML file")
print("\nConfiguration file contents:")
print("=" * 70)
with open(config_file, 'r') as f:
print(f.read())
print("=" * 70)
✓ Configuration saved to TOML file
Configuration file contents:
======================================================================
[generation]
entity_count = 100
entity_distribution = "Equal"
cardinality_strategy = "Balanced"
[field_generators.default]
locale = "en"
quality = "Medium"
[field_generators.datatypes]
[field_generators.properties]
[output]
path = "/tmp/from_config.ttl"
format = "Turtle"
compress = false
write_stats = true
parallel_writing = false
parallel_file_count = 0
[parallel]
worker_threads = 4
batch_size = 100
parallel_shapes = true
parallel_fields = true
======================================================================
# Load configuration from file
loaded_config = pyrudof.GeneratorConfig.from_toml_file(config_file)
print("✓ Configuration loaded from file")
print(f" Entity count: {loaded_config.get_entity_count()}")
print(f" Output path: {loaded_config.get_output_path()}")
# You can override settings after loading
loaded_config.set_entity_count(50)
loaded_config.set_output_path("/tmp/from_config_modified.ttl")
print("\n✓ Configuration modified")
print(f" New entity count: {loaded_config.get_entity_count()}")
✓ Configuration loaded from file
Entity count: 100
Output path: /tmp/from_config.ttl
✓ Configuration modified
New entity count: 50
Part 7: Different Output Formats#
The generator supports multiple RDF output formats.
# Generate in Turtle format
config_turtle = pyrudof.GeneratorConfig()
config_turtle.set_entity_count(3)
config_turtle.set_output_path("/tmp/tutorial_output.ttl")
config_turtle.set_output_format(pyrudof.OutputFormat.Turtle)
generator_turtle = pyrudof.DataGenerator(config_turtle)
generator_turtle.run(schema_path)
print("Turtle Format:")
print("=" * 70)
with open("/tmp/tutorial_output.ttl", 'r') as f:
print(f.read()[:500])
print("...")
print("=" * 70)
Turtle Format:
======================================================================
<http://example.org/Person-1> <http://example.org/name> "Delta274" ;
a <http://example.org/Person> .
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-1-0> , <http://example.org/Course-1> ;
<http://example.org/name> "Epsilon282" ;
<http://example.org/birthdate> "1953-05-09"^^<http://www.w3.org/2001/XMLSchema#date> ;
a <http://example.org/Person> .
<http://example.org/Course-1-0> <http://example.org/name> "Gamma102" ;
a <http://example.org/Course> .
<ht
...
======================================================================
# Generate in NTriples format
config_ntriples = pyrudof.GeneratorConfig()
config_ntriples.set_entity_count(3)
config_ntriples.set_output_path("/tmp/tutorial_output.nt")
config_ntriples.set_output_format(pyrudof.OutputFormat.NTriples)
generator_ntriples = pyrudof.DataGenerator(config_ntriples)
generator_ntriples.run(schema_path)
print("NTriples Format:")
print("=" * 70)
with open("/tmp/tutorial_output.nt", 'r') as f:
print(f.read()[:500])
print("...")
print("=" * 70)
NTriples Format:
======================================================================
<http://example.org/Course-1> <http://example.org/name> "Delta880" .
<http://example.org/Course-1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Course> .
<http://example.org/Person-2> <http://example.org/birthdate> "2010-08-07"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://example.org/Person-2> <http://example.org/name> "Delta702" .
<http://example.org/Person-2> <http://example.org/enrolledIn> <http://example.org/Course-1> .
<http://example.org/Person-2> <http://exa
...
======================================================================
# Create a simple SHACL schema
shacl_schema = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:PersonShape
a sh:NodeShape ;
sh:targetClass ex:Person ;
sh:property [
sh:path ex:name ;
sh:datatype xsd:string ;
sh:minCount 1 ;
sh:maxCount 1 ;
] ;
sh:property [
sh:path ex:age ;
sh:datatype xsd:integer ;
sh:minCount 1 ;
sh:maxCount 1 ;
] .
"""
# Save SHACL schema
shacl_path = "/tmp/tutorial_schema.shacl.ttl"
with open(shacl_path, 'w') as f:
f.write(shacl_schema)
print("✓ SHACL schema created")
# Generate data from SHACL schema
shacl_config = pyrudof.GeneratorConfig()
shacl_config.set_entity_count(10)
shacl_config.set_output_path("/tmp/tutorial_shacl_output.ttl")
shacl_config.set_output_format(pyrudof.OutputFormat.Turtle)
shacl_generator = pyrudof.DataGenerator(shacl_config)
# Explicitly load as SHACL
shacl_generator.load_shacl_schema(shacl_path)
shacl_generator.generate()
print("✓ Data generated from SHACL schema")
# View the generated data
with open("/tmp/tutorial_shacl_output.ttl", 'r') as f:
shacl_output = f.read()
print("\nGenerated data from SHACL:")
print("=" * 70)
print(shacl_output[:500])
print("...")
print("=" * 70)
✓ SHACL schema created
✓ Data generated from SHACL schema
Generated data from SHACL:
======================================================================
<http://example.org/PersonShape-3> a <http://example.org/PersonShape> ;
<http://example.org/name> "Beta650" ;
<http://example.org/age> 6884 .
<http://example.org/PersonShape-5> a <http://example.org/PersonShape> ;
<http://example.org/name> "Gamma525" ;
<http://example.org/age> 8054 .
<http://example.org/PersonShape-8> a <http://example.org/PersonShape> ;
<http://example.org/name> "Alpha384" ;
<http://example.org/age> 4237 .
<http://example.org/PersonShape-7> a <http://example.org/PersonSha
...
======================================================================
Working with SHACL Schemas#
The generator also supports SHACL schemas in addition to ShEx.
# Create a fully configured generator and inspect it
inspect_config = pyrudof.GeneratorConfig()
inspect_config.set_entity_count(50)
inspect_config.set_output_path("/tmp/inspect_output.ttl")
inspect_config.set_output_format(pyrudof.OutputFormat.Turtle)
inspect_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
inspect_config.set_worker_threads(4)
inspect_config.set_batch_size(100)
inspect_config.set_write_stats(True)
inspect_config.set_compress(False)
print("Configuration Overview:")
print("=" * 80)
print(inspect_config.show())
print("=" * 80)
print("\nYou can use show() to debug configuration issues or document your setup.")
Configuration Overview:
================================================================================
GeneratorConfig { generation: GenerationConfig { entity_count: 50, seed: None, entity_distribution: Equal, cardinality_strategy: Balanced, schema_format: None }, field_generators: FieldGeneratorConfig { default: DefaultFieldConfig { locale: "en", quality: Medium }, datatypes: {}, properties: {} }, output: OutputConfig { path: "/tmp/inspect_output.ttl", format: Turtle, compress: false, write_stats: true, parallel_writing: false, parallel_file_count: 0 }, parallel: ParallelConfig { worker_threads: Some(4), batch_size: 100, parallel_shapes: true, parallel_fields: true } }
================================================================================
You can use show() to debug configuration issues or document your setup.
Inspecting Configuration#
You can view the current configuration settings using the show() method.
# Method 1: Explicit ShEx loading
explicit_config = pyrudof.GeneratorConfig()
explicit_config.set_entity_count(5)
explicit_config.set_output_path("/tmp/tutorial_explicit_shex.ttl")
explicit_config.set_output_format(pyrudof.OutputFormat.Turtle)
generator_explicit = pyrudof.DataGenerator(explicit_config)
# Explicitly load ShEx schema
generator_explicit.load_shex_schema(schema_path)
# Then generate data
generator_explicit.generate()
print("✓ Method 1: Explicit ShEx loading and generation completed")
# Method 2: Using run_with_format for explicit format specification
format_config = pyrudof.GeneratorConfig()
format_config.set_entity_count(5)
format_config.set_output_path("/tmp/tutorial_with_format.ttl")
format_config.set_output_format(pyrudof.OutputFormat.Turtle)
generator_format = pyrudof.DataGenerator(format_config)
generator_format.run_with_format(schema_path, pyrudof.SchemaFormat.ShEx)
print("✓ Method 2: run_with_format completed")
# Method 3: Auto-detect (this is what run() does internally)
auto_config = pyrudof.GeneratorConfig()
auto_config.set_entity_count(5)
auto_config.set_output_path("/tmp/tutorial_auto.ttl")
auto_config.set_output_format(pyrudof.OutputFormat.Turtle)
generator_auto = pyrudof.DataGenerator(auto_config)
generator_auto.load_schema_auto(schema_path)
generator_auto.generate()
print("✓ Method 3: Auto-detect and generation completed")
print("\nAll three methods produce equivalent results!")
✓ Method 1: Explicit ShEx loading and generation completed
✓ Method 2: run_with_format completed
✓ Method 3: Auto-detect and generation completed
All three methods produce equivalent results!
Explicit Schema Loading Methods#
Instead of using the auto-detect run() method, you can explicitly load and generate in separate steps.
# Create a configuration and save it as JSON
json_config = pyrudof.GeneratorConfig()
json_config.set_entity_count(25)
json_config.set_output_path("/tmp/from_json.ttl")
json_config.set_output_format(pyrudof.OutputFormat.Turtle)
json_config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Random)
json_config.set_worker_threads(2)
# Note: JSON export might not be implemented yet, let's try
# If it fails, we'll create JSON manually
try:
json_config_file = "/tmp/generator_config.json"
# Manually create a JSON config file based on TOML structure
import json
config_data = {
"entity_count": json_config.get_entity_count(),
"output_path": json_config.get_output_path(),
"worker_threads": json_config.get_worker_threads()
}
with open(json_config_file, 'w') as f:
json.dump(config_data, f, indent=2)
print("✓ JSON configuration file created:")
print("=" * 70)
with open(json_config_file, 'r') as f:
print(f.read())
print("=" * 70)
# Try to load it back
try:
loaded_json_config = pyrudof.GeneratorConfig.from_json_file(json_config_file)
print("\n✓ Configuration loaded from JSON")
print(f" Entity count: {loaded_json_config.get_entity_count()}")
except Exception as e:
print(f"\n⚠️ JSON loading not yet fully implemented: {e}")
print(" (This is a known limitation)")
except Exception as e:
print(f"⚠️ JSON configuration: {e}")
✓ JSON configuration file created:
======================================================================
{
"entity_count": 25,
"output_path": "/tmp/from_json.ttl",
"worker_threads": 2
}
======================================================================
⚠️ JSON loading not yet fully implemented: JSON parsing error: missing field `generation` at line 5 column 1
(This is a known limitation)
Working with JSON Configuration Files#
Besides TOML, you can also use JSON configuration files.
# Configure parallel writing (useful for very large datasets)
parallel_config = pyrudof.GeneratorConfig()
parallel_config.set_entity_count(20)
parallel_config.set_output_path("/tmp/tutorial_parallel.ttl")
parallel_config.set_output_format(pyrudof.OutputFormat.Turtle)
# Enable parallel writing and split across 3 files
parallel_config.set_parallel_writing(True)
parallel_config.set_parallel_file_count(3)
print("✓ Parallel writing configuration:")
print(f" Entity count: {parallel_config.get_entity_count()}")
print(f" Parallel writing: Enabled")
print(f" Output files: 3")
# Generate data
parallel_generator = pyrudof.DataGenerator(parallel_config)
parallel_generator.run(schema_path)
print("\n✓ Data generated with parallel writing")
print("\nGenerated files:")
import glob
parallel_files = glob.glob("/tmp/tutorial_parallel*.ttl")
for f in parallel_files:
size = os.path.getsize(f)
print(f" {os.path.basename(f)}: {size} bytes")
✓ Parallel writing configuration:
Entity count: 20
Parallel writing: Enabled
Output files: 3
✓ Data generated with parallel writing
Generated files:
tutorial_parallel_part_001.ttl: 1199 bytes
tutorial_parallel_part_003.ttl: 1104 bytes
tutorial_parallel_part_002.ttl: 1224 bytes
Entity Distribution#
Control the distribution of entity types in your generated dataset. This is useful when you need specific proportions of different entities for realistic testing scenarios.
# Configure entity distribution
# Control how many entities of each type are generated
distribution_config = pyrudof.GeneratorConfig()
distribution_config.set_entity_count(30)
distribution_config.set_output_path("/tmp/tutorial_distribution.ttl")
distribution_config.set_output_format(pyrudof.OutputFormat.Turtle)
distribution_config.set_write_stats(True)
# Set specific distribution: 70% Persons, 30% Courses
# This gives more control over the shape of your generated data
distribution_config.set_entity_distribution({
"http://example.org/Person": 0.7,
"http://example.org/Course": 0.3
})
print("✓ Entity distribution configuration:")
print(f" Total entities: {distribution_config.get_entity_count()}")
print(f" Distribution: 70% Person, 30% Course")
print(f" Expected: ~21 Persons, ~9 Courses")
# Generate data with custom distribution
distribution_generator = pyrudof.DataGenerator(distribution_config)
distribution_generator.run(schema_path)
print("\n✓ Data generated with custom entity distribution")
# Verify the distribution in the generated data
with open("/tmp/tutorial_distribution.ttl", 'r') as f:
distribution_output = f.read()
person_count = distribution_output.count('a <http://example.org/Person>')
course_count = distribution_output.count('a <http://example.org/Course>')
print(f"\nActual distribution:")
print(f" Persons: {person_count}")
print(f" Courses: {course_count}")
print(f" Ratio: {person_count/(person_count+course_count):.1%} / {course_count/(person_count+course_count):.1%}")
# View statistics
stats_path = "/tmp/tutorial_distribution.stats.json"
if os.path.exists(stats_path):
with open(stats_path, 'r') as f:
stats = json.load(f)
print(f"\n📊 Total triples generated: {stats.get('total_triples', 0)}")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[22], line 11
7 distribution_config.set_write_stats(True)
9 # Set specific distribution: 70% Persons, 30% Courses
10 # This gives more control over the shape of your generated data
---> 11 distribution_config.set_entity_distribution({
12 "http://example.org/Person": 0.7,
13 "http://example.org/Course": 0.3
14 })
16 print("✓ Entity distribution configuration:")
17 print(f" Total entities: {distribution_config.get_entity_count()}")
TypeError: argument 'distribution': 'dict' object cannot be converted to 'EntityDistribution'
Parallel Writing for Large Datasets#
When generating very large datasets, you can split the output across multiple files for better performance.
Part 6: Advanced Features#
Let’s explore some additional advanced features that haven’t been covered yet.
Summary#
In this tutorial, you learned:
Core Features#
Basic Setup - Import and initialize the pyrudof library
Simple Data Generation - Generate RDF data from ShEx schemas
Advanced Configuration - Fine-tune generation with multiple options
Cardinality Strategies - Control relationship density (Minimum, Maximum, Random, Balanced)
Configuration Files - Save and load settings from TOML files
Advanced Features#
Output Formats - Generate data in Turtle, NTriples, and other RDF formats
SHACL Schema Support - Generate data from SHACL schemas
Configuration Inspection - View and debug configuration settings with
show()Explicit Schema Loading - Load schemas explicitly (ShEx/SHACL) or auto-detect
JSON Configuration - Work with JSON configuration files
Parallel Writing - Split large datasets across multiple files
Entity Distribution - Control proportions of different entity types
Next Steps#
Try with your own ShEx or SHACL schemas
Generate larger datasets for testing
Experiment with parallel processing and parallel writing settings
Integrate into your data pipelines
Validate generated data against schemas
Test different cardinality strategies for your use case
Resources#
# Clean up temporary files (optional)
import glob
temp_files = glob.glob("/tmp/tutorial_*")
print(f"Generated {len(temp_files)} temporary files during this tutorial")
print("\nTo clean up, you can remove files in /tmp/tutorial_*")
Generated 18 temporary files during this tutorial
To clean up, you can remove files in /tmp/tutorial_*