Skip to content

Parallel Graph Loading

anilpacaci edited this page Jun 29, 2017 · 1 revision

Pre-requisites

  1. Adjacency List File Format generated and uploaded in HDFS
  2. Partition Mapping generated and available as text file
  3. Cassandra Cluster running
    • I prefer Cassandra instances to use local disk for storage, so I make sure that data directories are set to specific directory in local file system
    • Rename Cassandra data directories using dataset name, specific partitioning scheme name and # of partitions such as sf3_metis_4
    • Cassandra instances should be configured to use ByteOrderedPartitioner and initial_token should be set appropriately (token range should be equally divided into # of partitions)
  4. JanusGraph installation
    • Edgecut branch of anilpacaci's fork

ID-Partition Map Bulk Loading

  1. Altough Java HashMap might be sufficient for smaller datasets, its better to stick with Memcached
  2. Run Memcached with significant enough memory
    • ~512M for each sf (sf3 --> ~1500M)
  3. PartitionLookup Importer Script can be used to populate memcached
    • run gremlin console from JanusGraph
    • modify snb.properties file to point your running Memcached, and partition lookup file
    • using :load command, load the PartitionLookupImporter.groovy script
    • run PartitionLookupImporter.load(path to snb.properties file)

Parallized Bulk Loading

  1. Configure JanusGraph instance
    • We need to configure a JanusGraph instance to run on Cassandra Cluster, here is a sample configuration file I use janusgraph-cassandra-es-server.properties
    • Make sure that storage.hostname points to the IP of cassandra instance (running on the same machine)
    • Make sure that ids.placement-history-hostname` points to running memcached server
    • ids.placement is PartitionAwarePlacementStrategy for fennel, ldg and metis
    • Load initialization (script)[https://github.com/anilpacaci/graph-partitioning/blob/master/scripts/initJanus.groovy] using :load command from gremlin console
    • call initializeJanus([path to janusgraph-cassandra-es-server.properties])
    • Initialize graph over the newly created graph instance:

graph = JanusGraphFactory.open([path to janusgraph-cassandra-es-server.properties])

  1. Start Multithreaded data loader
    • Load SNB (script)[https://github.com/anilpacaci/graph-partitioning/blob/master/scripts/SNBParser.groovy] using :load command from gremlin console
    • Configure snb.properties file (point to correct memcached instance, input.base should point to social_network directory of the dataset loaded)
    • call SNBParser.loadSNBGraph([graph instance from step 1], [path to snb.properties])
Clone this wiki locally