layout | title | group | weight | section |
---|---|---|---|---|
default |
Myria Python/Jupyter |
docs |
4 |
2 |
Myria-Python is a Python interface to the Myria project, a distributed, shared-nothing big data management system and Cloud service from the University of Washington.
The Python components include intuitive, high-level interfaces for working with Myria, along with lower-level operations for interacting directly with the Myria API.
Developers interact with the Myria system using MyriaConnection
instances to establish a connection to the database, MyriaQuery
instances to issue queries and obtain results, and MyriaRelation
instances to interact with stored data. Data may be uploaded in a variety of formats via a URL or the local file system. Downloaded data may be easily converted into Python dictionaries, Pandas dataframes, and Numpy arrays. A general workflow might involve the following high-level steps:
Myria-Python is also compatible with Jupyter (IPython) Notebooks. See the section below for examples.
The following example illustrates a subset of the functionality available in the Myria Python library:
from myria import *
## Establish a default connection to Myria
MyriaRelation.DefaultConnection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
## Higher-level interaction via relation and query instances
query = MyriaQuery.submit(
"""books = load('https://raw.githubusercontent.com/uwescience/myria-python/master/ipnb%20examples/books.csv',
csv(schema(name:string, pages:int)));
longerBooks = [from books where pages > 300 emit name];
store(longerBooks, LongerBooks);""")
# Download relation and convert it to JSON
json = query.to_dict()
# ... or download to a Pandas Dataframe
dataframe = query.to_dataframe()
# ... or download to a Numpy array
dataframe = query.to_dataframe().as_matrix()
## Access an already-stored relation
relation = MyriaRelation(relation='LongerBooks')
print len(relation)
## Lower-level interaction via the REST API
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
datasets = connection.datasets()
Users can install the Python libraries using pip install myria-python
. Developers should clone the repository and run python setup.py develop
.
In this Python example, we query the smallTable relation by creating a count(*)
query. In this query, we store our result to a relation called dataCount. To learn more about the Myria query language, check out the MyriaL page.
from myria import *
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
query = MyriaQuery.submit("""
data = load('https://raw.githubusercontent.com/uwescience/myria/master/jsonQueries/getting_started/smallTable',
csv(schema(left:int, right:int)));
q = [from data emit count(*)];
store(q, dataCount);""", connection=connection)
print query.to_dict()
In the previous example we downloaded the result of a query. We can also download data that has been stored as a relation:
from myria import *
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
# Load some data and store it in Myria
query = MyriaQuery.submit("""
data = load('https://raw.githubusercontent.com/uwescience/myria/master/jsonQueries/getting_started/smallTable',
csv(schema(left:int, right:int)));
store(data, data);""", connection=connection)
# Now access previously-stored data
relation = MyriaRelation('data', connection=connection)
print relation.to_dict()[:5]
from myria import *
name = {'userName': 'public', 'programName': 'adhoc', 'relationName': 'Books'}
schema = { "columnNames" : ["name", "pages"],
"columnTypes" : ["STRING_TYPE","LONG_TYPE"] }
data = """Brave New World,288
Nineteen Eighty-Four,376
We,256"""
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
result = connection.upload_file(
name, schema, data, delimiter=',', overwrite=True)
relation = MyriaRelation("Books", connection=connection)
print relation.to_dict()
import sys
import urllib
import random
from myria import *
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
# Download a sample file to our local filesystem
urllib.urlretrieve ("https://raw.githubusercontent.com/uwescience/myria-python/master/ipnb%20examples/books.csv",
"books.csv")
# Initialize a name and schema for the new relation
name = {'userName': 'public',
'programName': 'adhoc',
'relationName': 'Books' + str(random.randrange(sys.maxint)) } # Name must be unique!
schema = { "columnNames" : ["name", "pages"],
"columnTypes" : ["STRING_TYPE","LONG_TYPE"] }
# Now upload that file to Myria
with open('books.csv') as f:
connection.upload_fp(name, schema, f)
# Now access the new relation
relation = MyriaRelation(name, connection=connection)
print relation.to_dict()
In the example below, we upload a local CSV file to the Myria Service. Here is an example you can run through your terminal (assuming you've setup myria-python):
wget https://raw.githubusercontent.com/uwescience/myria/master/jsonQueries/getting_started/smallTable
myria_upload --overwrite --hostname demo.myria.cs.washington.edu --port 8753 --no-ssl --relation smallTable smallTable
Myria can upload a relation in parallel. Each worker must point to a partition of the file. Users must either create or have these partitions prepared in S3. In the example below, worker 1 reads the first part of the file (TwitterK-part1.csv) while worker 2 reads the last part of the file (TwitterK-part2.csv).
from myria import *
connection = MyriaConnection(rest_url='http://demo.myria.cs.washington.edu:8753')
schema = MyriaSchema({"columnTypes" : ["LONG_TYPE", "LONG_TYPE"], "columnNames" : ["follower", "followee"]})
relation = MyriaRelation('parallelLoad', connection=connection, schema=schema)
# A list of worker-URL pairs -- must be one for each worker
work = [(1, 'https://s3-us-west-2.amazonaws.com/uwdb/sampleData/TwitterK-part1.csv'),
(2, 'https://s3-us-west-2.amazonaws.com/uwdb/sampleData/TwitterK-part2.csv')]
# Upload the data (CSV is the default upload type)
query = MyriaQuery.parallel_import(relation=relation, work=work)
print query.status
Myria exposes convenience functionality when running within the Jupyter/IPython environment. See our sample IPython notebook for a live demo.
%load_ext myria
%connect http://demo.myria.cs.washington.edu:8753
%%query
books = load('https://raw.githubusercontent.com/uwescience/myria-python/master/ipnb%20examples/books.csv',
csv(schema(name:string, pages:int)));
longerBooks = [from books where pages > 300 emit name];
store(longerBooks, LongerBooks);
You can embed local Python variables into a query expression. For example, assume we have set the following local variables:
low, high, name = 300, 1000, 'MyBooks'
Now we can execute a query in an IPython notebook that binds over our local environment:
%%query
books = load('https://raw.githubusercontent.com/uwescience/myria-python/master/ipnb%20examples/books.csv',
csv(schema(name:string, pages:int)));
longerBooks = [from books where pages > @low and pages < @high emit name];
store(longerBooks, @name);
The above examples use MyriaL. For more information, please see http://myria.cs.washington.edu/docs/myrial.html.