Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
gintarasm committed Aug 4, 2023
0 parents commit 08bc3c9
Show file tree
Hide file tree
Showing 61 changed files with 4,385 additions and 0 deletions.
35 changes: 35 additions & 0 deletions .github/workflows/gradle.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
# This workflow will build a Java project with Gradle and cache/restore any dependencies to improve the workflow execution time
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-java-with-gradle

name: Java CI with Gradle

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

permissions:
contents: read

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up JDK 11
uses: actions/setup-java@v3
with:
java-version: '11'
distribution: 'temurin'
- name: Build and test
uses: gradle/gradle-build-action@v2.5.0
with:
arguments: build

4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.gradle
build
.idea
**/.DS_Store
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 vinted

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
137 changes: 137 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Flink BigQuery Connector ![Build](https://github.com/vinted/flink-big-query-connector/actions/workflows/gradle.yml/badge.svg) [![](https://jitpack.io/v/com.vinted/flink-big-query-connector.svg)](https://jitpack.io/#com.vinted/flink-big-query-connector)


This project provides a BigQuery sink that allows writing data with exactly-once or at-least guarantees.

## Usage

There are builder classes to simplify constructing a BigQuery sink. The code snippet below shows an example of building a BigQuery sink in Java:

```java
var credentials = new JsonCredentialsProvider("key");

var clientProvider = new BigQueryProtoClientProvider(credentials,
WriterSettings
.newBuilder()
.build()
);

var bigQuerySink = BigQueryStreamSink
.<String>newProto()
.withClientProvider(clientProvider)
.withDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.withRowValueSerializer(new NoOpRowSerializer<>())
.build();
```

The sink takes in a batch of records. Batching happens outside the sink by opening a window. Batched records need to implement the BigQueryRecord interface.

```java
var trigger = BatchTrigger.<Record, GlobalWindow>builder()
.withCount(100)
.withTimeout(Duration.ofSeconds(1))
.withSizeInMb(1)
.withResetTimerOnNewRecord(true)
.build();

var processor = new BigQueryStreamProcessor().withDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE).build();

source
.key(s -> s)
.window(GlobalWindows.create())
.trigger(trigger)
.process(processor)
```


To write to BigQuery, you need to:

* Define credentials
* Create a client provider
* Batch records
* Create a value serializer
* Sink to BigQuery

# Credentials

There are two types of credentials:

* Loading from a file
```java
new FileCredentialsProvider("/path/to/file")
```
* Passing as a JSON string
```java
new JsonCredentialsProvider("key")
```

# Types of Streams

BigQuery supports two types of data formats: json and proto. When creating a stream, you can choose these types by creating the appropriate client and using the builder methods.

* JSON
```java
var clientProvider = new BigQueryJsonClientProvider(credentials,
WriterSettings
.newBuilder()
.build()
);

var bigQuerySink = BigQueryStreamSink
.<String>newJson()
```
* Proto
```java
var clientProvider = new BigQueryProtoClientProvider(credentials,
WriterSettings
.newBuilder()
.build()
);

var bigQuerySink = BigQueryStreamSink
.<String>newProto()
```

# Exactly once
It utilizes a [buffered stream](https://cloud.google.com/bigquery/docs/write-api#buffered_type), managed by the BigQueryStreamProcessor, to assign and process data batches. If a stream is inactive or closed, a new stream is created automatically. The BigQuery sink writer appends and flushes data to the latest offset upon checkpoint commit.
# At least once
Data is written to the [default stream](https://cloud.google.com/bigquery/docs/write-api#default_stream) and handled by the BigQueryStreamProcessor, which batches and sends rows to the sink for processing.
# Serializers

For the proto stream, you need to implement `ProtoValueSerializer`, and for the JSON stream, you need to implement `JsonRowValueSerializer`.

# Metrics
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 15%">Scope</th>
<th class="text-left" style="width: 18%">Metrics</th>
<th class="text-left" style="width: 39%">Description</th>
<th class="text-left" style="width: 10%">Type</th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="8">Stream</th>
<td>stream_offset</td>
<td>Current offset for the stream. When using at least once, the offset is always 0</td>
<td>Gauge</td>
</tr>
<tr>
<td>batch_count</td>
<td>Number of records in the appended batch</td>
<td>Gauge</td>
</tr>
<tr>
<td>batch_size_mb</td>
<td>Appended batch size in mb</td>
<td>Gauge</td>
</tr>
<tr>
<td>split_batch_count</td>
<td>Number of times the batch hit the BigQuery limit and was split into two parts</td>
<td>Gauge</td>
</tr>
</tbody>
</table>

80 changes: 80 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
plugins {
id 'com.github.johnrengelman.shadow' version '7.1.2'
id 'maven-publish'
id 'java-library'
id 'me.qoomon.git-versioning' version '4.3.0'
}

sourceCompatibility = JavaVersion.VERSION_11
targetCompatibility = JavaVersion.VERSION_11

repositories {
mavenCentral()
mavenLocal()
maven { url 'https://packages.confluent.io/maven/' }
maven { url 'https://nexus.vinted.net/repository/maven-proxy-oom/' }
}

group = "com.github.vinted"

gitVersioning.apply {
tag {
pattern = '(?<tagVersion>\\d+\\.\\d+\\.\\d+$)'
versionFormat = '${tagVersion}'
}
branch {
pattern = '.+'
versionFormat = '${branch}-SNAPSHOT'
}
preferTags = false
}

ext {
flinkVersion = '1.17.0'
bigqueryVersion = '2.27.0'
bigqueryStorageVersion = '2.37.2'
json4sVersion = '4.0.3'
}

dependencies {
// Flink provided dependencies
compileOnly "org.apache.flink:flink-connector-base:$flinkVersion"
compileOnly "org.apache.flink:flink-streaming-java:$flinkVersion"
compileOnly "org.apache.flink:flink-core:$flinkVersion"

implementation "com.google.cloud:google-cloud-bigquery:$bigqueryVersion"
implementation "com.google.cloud:google-cloud-bigquerystorage:$bigqueryStorageVersion"

testImplementation platform('org.junit:junit-bom:5.9.1')
testRuntimeOnly "org.junit.jupiter:junit-jupiter-engine"
testImplementation 'org.mockito:mockito-junit-jupiter:4.11.0'
testImplementation "org.assertj:assertj-core:3.24.0"
testImplementation "org.apache.flink:flink-connector-base:$flinkVersion"
testImplementation "org.mockito:mockito-inline:4.7.0"
testImplementation "org.apache.flink:flink-test-utils:$flinkVersion"
testImplementation 'commons-io:commons-io:2.11.0'
}

test {
useJUnitPlatform()
testLogging {
events "passed", "skipped", "failed"
}
}

shadowJar {
classifier = null
relocate 'io.grpc', 'com.vinted.flink.bigquery.shaded.io.grpc'
relocate 'io.netty', 'com.vinted.flink.bigquery.shaded.io.netty'
mergeServiceFiles()
}

publishing {
publications {
java(MavenPublication) {
artifactId = project.name
artifact shadowJar
}
}
}

Binary file added gradle/wrapper/gradle-wrapper.jar
Binary file not shown.
5 changes: 5 additions & 0 deletions gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.3.3-bin.zip
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
Loading

0 comments on commit 08bc3c9

Please sign in to comment.