Golang API that indexes web pages in Elasticsearch. Accepts POST requests and runs a crawl in the background.
go 1.13.5^
Elasticsearch v7.5.1^
Requires an config yaml in conf
.
For instance:
Path: /conf/local.yml
elasticsearch:
endpoint: http://localhost:9200
password: changeme
username: elastic
appsearch:
endpoint: http://localhost:3002
api: /api/as/v1/
token: private-xxxxxxxxxxxxxxxxx
server:
port: 8081
readHeaderTimeoutMillis: 3000
Steps:
- Launch App Search and Elasticsearch
- Create a local config file:
elasticsearch:
endpoint: http://localhost:9200
appsearch:
endpoint: http://localhost:3002
api: /api/as/v1/
token: private-xxxxxxxxxxxxxxx
server:
port: 8081
readHeaderTimeoutMillis: 3000
- Install vendor dependencies:
go mod vendor
- Export env ID:
export ENV_ID=local
- Create an env config in
/conf
(example above). The name of this config should match the value of the env ID exported. - Compile (required to run the binary locally):
GO_ENABLED=0 go build -mod vendor -o ./bin/elastic-webcrawler ./cmd/elastic-webcrawler/main.go
- Run the compiled binary:
./bin/elastic-webcrawler
- If using App Search, create the engine in App Search (API doesn't create it for you).
- Launch a crawl:
curl -XPOST localhost:8081/crawl -d '{
"engine": "swiftype-website",
"url": "https://swiftype.com/",
"type": "app-search"
}'
This project builds and publishes a container with two tags, latest
and commit_hash
, to Docker Hub on merge to master. If you're running the container locally with Elasticsearch and/or App Search running, make sure to run all of them on the same docker network. More about Docker networks can be found here.
Docker Hub: https://hub.docker.com/repository/docker/wambozi/elastic-webcrawler
Steps:
- Launch App Search and Elasticsearch
- Inspect docker network(s) to get the subnet for your bridge.
$ docker network inspect bridge
[
{
"Name": "bridge",
"Id": "b38c312777a0f3890034c9b396669842947b80c9051d10a283c9d43937910578",
"Scope": "local",
"Driver": "bridge",
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.2/16" << CIDR for our bridge
}
]
},
...
]
- Create a local config file in
conf
using the subnet for the network you plan to run the Elasticsearch and App Search containers on. In this case, 172.17.0.2, e.g:
elasticsearch:
endpoint: http://172.18.0.2:9200
appsearch:
endpoint: http://172.18.0.2:3002
api: /api/as/v1/
token: private-xxxxxxxxxxxxxxx
server:
port: 8081
readHeaderTimeoutMillis: 3000
- Run the container. The
docker run
command doesn't need to specify the docker network, as long as we put the subnet for the bridge network in ourlocal.yml
docker run -d --name elastic --network=bridge -it -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "action.auto_create_index=.app-search-*-logs-*,-.app-search-*,+*" docker.elastic.co/elasticsearch/elasticsearch:7.5.1
docker run -d --name app-search --network=bridge -it -p 3002:3002 -e "allow_es_settings_modification=true" docker.elastic.co/app-search/app-search:7.5.1
docker run --rm --name webcrawler -it -e "ENV_ID=local" -v "$(pwd)/conf:/conf" -p 8081:8081 wambozi/elastic-webcrawler:latest
--network=bridge
: The default network driver. If you don’t specify a driver, this is the type of network you are creating. Bridge networks are usually used when your applications run in standalone containers that need to communicate. I specify it here for transparency.-t
: Allocate a pseudo-tty-i
: Keep STDIN open even if not attached-v
: Mount the current dir into /conf dir of the container (so it makes local.yml accessible here). Using bind mounts in docker-e
: Required to specify the name of the env file. IfENV_ID=local
isn't passed into the container, the container will exit with:ERRO[0000] stdErr: &{file:0xc0000980c0} , error: Error reading config file. env: nick error: Config File "prod" Not Found in "[/conf /opt/bin/conf /opt/bin]"
. The value ofENV_ID
should be the name of the file being used.-p
expose the webserver port. This port should correspond to the value forserver.port
in your config.
- If using App Search, create the engine in App Search (API doesn't create it for you).
- Launch a crawl:
- Elasticsearch Crawl:
curl -XPOST localhost:8081/crawl -d '{
"engine": "example-website",
"url": "https://example.com/",
"type": "app-search"
}'
- App Search Crawl:
curl -XPOST localhost:8081/crawl -d '{
"index": "example-website",
"url": "https://example.com/",
"type": "elasticsearch"
}'
- To run the docker container locally with elasticsearch and app-search, using make:
make run-local
- This runs the
docker run
commands above and checks that Elasticsearch is healthy
- This runs the
Example POST body for an Elasticsearch crawl:
{
"index": "demo",
"url": "http://www.example.com",
"type": "elasticsearch"
}
Example POST body for an AppSearch crawl:
{
"engine": "demo",
"url": "http://www.example.com",
"type": "app-search"
}
Example response:
{
"status": 201,
"url": "http://www.example.com"
}
- Adam Bemiller
- Adam provided most of the high level project and server/routes framework for this project. Huge thanks to him!
MIT License