-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdata-science-lingo1.qmd
More file actions
61 lines (44 loc) · 2.31 KB
/
data-science-lingo1.qmd
File metadata and controls
61 lines (44 loc) · 2.31 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
title: "Vocabulary Lingo Challenge"
uses: quarto-dev/quarto-actions/publish@v2
with:
target: gh-pages
to: html
path: source-folder
format: #hugo-md
html:
self-contained: true
page-layout: full
title-block-banner: true
toc: true
toc-depth: 3
toc-location: body
number-sections: false
html-math-method: katex
code-copy: true
code-fold: true
code-line-numbers: true
code-summary: "Show the code"
code-overflow: wrap
code-tools:
source: false
toggle: true
caption: See code
execute:
warning: false
---
## Link to Repository
[View the GitHub Repository](https://github.com/daviethedeveloper/app_challenge_wi24)
## The Value of Databricks in Data Science
Databricks streamlines data science projects by integrating data handling, analysis, and prediction tools into a single, collaborative platform. This unified approach facilitates teamwork, allowing everyone to access and work on the same datasets simultaneously.
## PySpark vs. Pandas/Tidyverse
Choosing between PySpark and Pandas/Tidyverse depends on your project's scale and requirements:
- **PySpark:** Ideal for large-scale data processing across distributed systems. It's like using a heavy-duty truck for long-haul data transportation, capable of handling extensive datasets efficiently.
- **Pandas/Tidyverse:** Best suited for smaller, single-machine data tasks. This tool is like a bicycle for quick, local trips—perfect for data cleaning and analysis on a smaller scale.
In essence, PySpark is your go-to for big data and distributed computing, while Pandas or Tidyverse excels in smaller, more contained projects.
- **Handling Size with PySpark:**
<img src="Images/Size_mm.jpg" alt="Handling Large Datasets with PySpark" width="300" height="300"/>
- **Handling Size with Pandas/Tidyverse:**
<img src="Images/Size2_mm.jpg" alt="Handling Smaller Datasets with Pandas/Tidyverse" width="300" height="300"/>
## Docker Explained
Docker is like a shipping container for software. Just as containers revolutionized how goods are transported by making them easily movable and scalable, Docker does the same for software. It ensures that an application works seamlessly, regardless of the environment it's running in, by packaging the application and its dependencies into a single container.