This project sets up a SolrCloud environment for indexing and searching parliamentary speeches. It can be adapted for other datasets with similar structures. This guide provides detailed, step-by-step instructions for setup, use, and customization.
- Prerequisites
- File Structure
- Detailed Setup Instructions
- Adapting for Different Datasets
- Troubleshooting
- Maintenance and Management
Ensure you have the following installed on your system:
- Docker (version 19.03 or later)
- Docker Compose (version 1.25 or later)
- Python 3.7 or later
- pip (Python package installer)
Install the required Python libraries:
pip install requests
Ensure your project directory is structured as follows:
project_root/
│
├── 01_deploy_solr.py
├── 02_upload_config.py
├── 03_create_collection.py
├── 04_upload_documents.py
├── 05_stop_and_clean.py
├── docker-compose.yml
├── configsets/
│ └── myconfig/
│ ├── schema.xml
│ └── solrconfig.xml
└── data/
└── Greek_Parliament_Proceedings_1989_2020_DataSample_with_id_and_formatted_date.csv
Follow these steps carefully to set up and run your SolrCloud environment:
-
Open a terminal and navigate to your project root directory.
-
Run the deployment script:
python 01_deploy_solr.py
-
The script will:
- Check if Docker is running
- Verify Docker Compose is installed
- Start the SolrCloud containers defined in
docker-compose.yml
- Wait for SolrCloud to be ready (this may take a few minutes)
-
If successful, you'll see the message: "SolrCloud deployment completed successfully."
- In the same terminal, run:
python 02_upload_config.py myconfig
- This script uploads the configuration files from
configsets/myconfig/
to ZooKeeper. - If successful, you'll see: "Configuration 'myconfig' uploaded successfully."
- Create a new collection by running:
python 03_create_collection.py parliamentary_speeches
- This creates a new collection named "parliamentary_speeches" using the uploaded configuration.
- If successful, you'll see: "Collection 'parliamentary_speeches' created successfully."
- Before running the upload script, ensure your CSV file is in the correct location and format.
- Run the upload script:
python 04_upload_documents.py
- This script will read the CSV file and upload the documents to Solr in batches.
- You'll see progress messages like: "Successfully sent X documents to Solr"
- At the end, it will commit the changes to Solr.
- Open a web browser and go to
http://localhost:8983/solr/
- You should see the Solr Admin interface.
- Click on "Core Selector" and choose "parliamentary_speeches"
- Go to the "Query" section and click "Execute Query" to see if your documents are indexed.
To use this setup with a different dataset, you'll need to modify several files:
- Open
configsets/myconfig/schema.xml
- Update the
<fields>
section to match your data structure:- Add, remove, or modify
<field>
elements - Ensure the
name
attribute matches your CSV column names - Choose appropriate
type
attributes (e.g., "string", "text_general", "date")
- Add, remove, or modify
- Adjust
<copyField>
directives if needed - Change the
<uniqueKey>
if your unique identifier field is different
Example field definition:
<field name="your_field_name" type="text_general" indexed="true" stored="true"/>
- Open
04_upload_documents.py
- Change the
input_file
path to your CSV file:input_file = 'data/your_new_data.csv'
- Modify the
solr_url
if you used a different collection name:solr_url = 'http://localhost:8983/solr/your_collection_name/update'
- Adjust the CSV parsing logic if your file structure is different:
reader = csv.DictReader(file) for row in reader: # Modify this part to match your CSV structure doc = { 'id': row['your_id_field'], 'field1': row['your_field1'], 'field2': row['your_field2'], # ... add all your fields here } docs.append(doc)
- Open
docker-compose.yml
- If you changed the config name, update the volume mapping:
volumes: - ./configsets/your_config_name:/opt/solr/server/solr/configsets/your_config_name
- Modify port mappings if needed (e.g., if port 8983 is already in use)
- Open
03_create_collection.py
- Change the default
config_name
if you used a different name:config_name = "your_config_name"
After making these changes, follow the setup instructions again from Step 1.
If you encounter issues:
-
SolrCloud fails to start:
- Check Docker logs:
docker-compose logs
- Ensure all required ports are available
- Verify Docker and Docker Compose versions
- Check Docker logs:
-
Configuration upload fails:
- Check ZooKeeper connectivity
- Ensure config files are in the correct location
-
Collection creation fails:
- Verify the config was uploaded successfully
- Check Solr logs for any error messages
-
Document upload fails:
- Verify your CSV structure matches the schema
- Check for any data formatting issues
- Ensure Solr is running and the collection exists
-
Query returns no results:
- Verify documents were uploaded successfully
- Check your query syntax
- Ensure you're querying the correct fields
-
Stopping the System: To stop and remove the SolrCloud containers and clean up data:
python 05_stop_and_clean.py
-
Backing Up Data:
- Use Solr's backup API or
- Create a snapshot of the data directory
-
Monitoring:
- Use Solr's admin interface for basic monitoring
- Consider setting up Prometheus and Grafana for advanced monitoring
-
Scaling:
- Add more Solr nodes in the
docker-compose.yml
file - Increase the number of shards when creating the collection
- Add more Solr nodes in the
Remember to adjust security settings and add authentication for production deployments. This setup is intended for development and testing purposes.