- Rdd, dataframe, dataset
- on-heap vs off-heap memory (lecture 4)
- Data bricks cluster (lecture 5)
- In DataBricks first, we need to launch the cluster and attach it to the notebook to execute the code
- cluster: cluster is computing infrastructure in the DataBricks environment. It is a set of computation resources and configurations.
- Cluster Types:
- All Purpose/Interactive cluster: [Main use is developing purpose]
- All-purpose clusters are used to execute and analyze data collaboratively using interactive notebooks.
- you can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
- Job cluster :
- job clusters are used to run fast and robust automated jobs.
- These job clusters are created automatically at the start of execution and terminate the cluster at the end of executions.
- It is only visible on job run time
- Pool:[It has ready to use the ideal instances to save the boot time]
- Data brick pools are created to reduce boot time and autoscaling.
- Clusters are attached to pools, and each pool is configured with a set of idle and ready-to-use instances.
- so when we attach a cluster from a pool of notebooks, there will always be instances ready to use. when the job got executed/completed cluster will return back to the pool
- All Purpose/Interactive cluster: [Main use is developing purpose]
- Cluster Modes ( 3 types)
- Standard: This mode is suitable for single user. if no team collaboration needed, we can go for this mode
- High Concurrency: This mode is more suitable for collaborations. it provides fine-grained sharing for maximum resource utilization and minimum query latencies.
- Single Node: This mode runs the job only on driver mode and no worker nodes are provisioned.
- In DataBricks first, we need to launch the cluster and attach it to the notebook to execute the code
PySpark Interview Questions (lec 6)
How to read files in spark?
You are going to see how to read different file formats in pyspark. first, you need to create the spark session then you will how to read csv, json, parquet, orc file formats.
syntax:
df = spark.read.format(file_type)\
.option("inferSchema", "true") \
.option("header", "true") \
.option("spe", ",")\
.load(file_locaiton)
#file type -> CSV, PARQUET, ORC, JSON, AVRO.
#path
-> load(path1)
-> load([path1, path2]) # for multiple files we need to pass list.
-> load(folder)
from pyspark.sql import SparkSession
spark = SparkSession.builder..master("local[*]")appName("readfileApp").gerOrCreate()
# now you will see how to read the diffrent file format
# Reading a CSV file
df_csv = spark.read.format("csv").option("header", "true").load("/path/to/csvfile.csv")
# Reading a JSON file
df_json = spark.read.format("json").load("/path/to/jsonfile.json")
# Reading a Parquet file
df_parquet = spark.read.format("parquet").load("/path/to/parquetfile.parquet")
# Reading an ORC file
df_orc = spark.read.format("orc").load("/path/to/orcfile.orc")
# Display the DataFrames
display(df_csv)
display(df_json)
display(df_parquet)
display(df_orc)
How to read all files from the folder/directory in Pyspark?
folder_path = "/path/to/folder"
df = spark.read.format("csv").option("header", "true").load(folder_path)
display(df)
How to read the latest file from the folder?
from pyspark.sql.functions import input_file_name
import os
# Define the folder path
folder_path = "/path/to/folder"
# List all files in the folder
files = dbutils.fs.ls(folder_path)
# Get the latest file based on modification time
latest_file = max(files, key=lambda x: x.modificationTime).path
# Read the latest file into a DataFrame
df = spark.read.csv(latest_file).option("header", "true")
# Display the DataFrame
display(df)
how do create your own schema?
There are two methods to do this you will see both methods in the following code snippet
from pyspark.sql.type import StructType, StructField, integerType, stringType
#Method 1
schema_defination = StructType([ StructField("id", integerType(), True),
StructField("name", stringType(), True),
StructField("age", integerType(), True) ])
df = spark.read.format("csv").schema(schema_defination).option("header",True).load("file_path")
#Method 2
schema_alternate = "id INTEGER, name STRING, age INTERGER"
Pyspark Filter Condition. lec(7)
In this section, we are going to cover the following filter conditions.
filter is used to limit the data result or extract a subset of master data.
- Single & Multiple Conditions
- Starts With
- Ends With
- Contains
- Like
- Null values
- isin