Metadata-Version: 2.1
Name: meteo-spark
Version: 0.0.3
Summary: A python package to process climate scientific files using pyspark.
Home-page: https://github.com/hussein-awala/MeteoSpark
Author: Hussein Awala
Author-email: hussein.awala.96@gmail.com
License: http://www.apache.org/licenses/LICENSE-2.0
Project-URL: Bug Tracker, https://github.com/hussein-awala/MeteoSpark/issues
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE

# meteo-spark

meteo-spark is an open source project that aims to simplify the Climate Data Analysis
using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html),
which allow the processing of very big files saved in the cloud
([S3](https://docs.aws.amazon.com/s3), [GCS](https://cloud.google.com/storage/docs),
...) on a large pyspark cluster managed by [YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html)
or [Kubernetes](https://kubernetes.io/).

## Installation
Install and update using [pip](https://pip.pypa.io/en/stable/getting-started/):
```shell
pip install meteo-spark
```

## A Simple Example

```python
from pyspark import SparkContext, SparkConf
from meteo_spark import load_dataset

# create a new spark context
conf = SparkConf().setAppName("Meteo Spark app")
sc = SparkContext(conf=conf)

# use the method load_dataset to read the netcdf files
# and load them as a RDD partitioned by longitude and latitude with 10 slices
meteo_data = load_dataset(
    sc,
    paths="data/*.nc",
    num_partitions=10,
    partition_on=["longitude", "latitude"]
)
# calculate the max temperature for each point for the whole period
max_meteo_data = meteo_data.map(lambda x: x["t2m"].max())
# take the first element
max_meteo_data.take(1)
```

