Study notes
Azure Databricks
- Microsoft analytics service, part of the Microsoft Azure cloud platform (integration of Apache Spark's Databricks, natively integrated into Azure security and data services)
- It runs on top of a proprietary data processing engine called Databricks Runtime
- Offers a fast, easy, and collaborative Spark based analytics service
- Key concepts:
- workspaces
It groups objects (like notebooks, libraries, experiments) into folders,
Provides access to your data
Provides access to the compute resources used (clusters, jobs). - clusters
set of compute resources on which you run your code
Before we can use a cluster, we have to choose one of the available runtimes:- Databricks Runtime
includes Apache Spark, components and updates that optimize the usability, performance, and security for big data analytics. - Databricks Runtime for Machine Learning
a variant that adds multiple machine learning libraries such as TensorFlow, Keras, and PyTorch. - Databricks Light
for jobs that don’t need the advanced performance, reliability, or autoscaling of the Databricks Runtime.
- Databricks Runtime
- workspaces
To access our data:
- Importour files to DBFS using the UI
- Upload a local file and import the data.
- Use data already existing under DBFS.
Once the data is uploaded, it will be available as a table or as a mountpoint under the DBFS filesystem (/FileStore).
- Mount and use supported data sources via DBFS
- Mount external data sources, like Azure Storage, Azure Data Lake and more.
- read data on cluster nodes using Spark APIs
DBFS mounted data
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters, it is an abstraction on top of scalable object storage.
- Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
- Allows you to interact with object storage using directory and file semantics instead of storage URLs.
- Persists files to object storage, so you won’t lose data after you terminate a cluster.
With DBFS you can access:
- Local files (previously imported). For example, the tables you imported above are available under /FileStore
- Remote files, objects kept in separate storages as if they were on the local file system
Notebooks in Databricks
Special here is that you can:
- choose default language / cells (Python, Scala, R, and SQL). You can override the default language by specifying the language magic command %<language> at the beginning of a cell, supported magic commands are:
- %python
- %r
- %scala
- %sql
- cluster where that cell will run
Spark uses 3 different APIs: Resilient Distributed Dataset(RDD), DataFrames, and DataSets. In Azure ML most common is Dataframe.
DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type.
Load data in dataframe:
df = spark.sql("SELECT * FROM nyc_taxi_csv")
Other common statements (Dataframe API):df = spark.read.format('json').load()
df.write.format('parquet').bucketBy(100, 'year', 'month').mode("overwrite").saveAsTable('table1'))
df.select('*')
df.select(COLUMNS)
...
Available statistics are:
- Count
- Mean
- Stddev
- Min
- Max
- Arbitrary approximate percentiles specified as a percentage (for example, 75%).
df.corr('COLUMN1', 'COLUMS2')
Visualize data
- show()
Spark bulit in - display()
Azure Databricks - displayHTML()
Azure Databricks
Resources:
Get started with Azure Databricks - Training | Microsoft Learn
Work with data in Azure Databricks - Training | Microsoft Learn