Spark 소개

This chapter provides a high-level overview of what Apache Spark is. If you are already familiar with Apache Spark and its components, feel free to jump ahead to Chapter 2.

What Is Apache Spark?

아파치 스파크는 빠른 일반목적의 클러스터 컴퓨팅 플랫폼이다.

속도면에서는 스파크는 맵리듀스 모델이 더 많은 타입의 계산을 효율적으로 지원하도록 확장한다. 여기에는 대화식 쿼리와 스트림 처리가 들어있다. 속도는 대용량 데이터 처리에서 중요하다. 즉 데이터를 대화식으로 탐색하는 것과 수 분 또는 수 시간을 기다른 것의 차이를 의미한다. 스파크의 주요 기능 중에 하나가는 메모리에서 계산하는 능력을 제공하는 것이지만, 디스크에서 복잡한 앱을 실행하는 맵리듀스보다 시스템이 더욱 효과적이다.

범용성면에서 스파크는 과거에는 별도의 분산 시스템을 요구하는 다양한 범위의 워크로드를 커버하도록 설계되었다. 여기에는 배치 응용, 반복 알고리즘, 대화식 쿼리와 스트리밍이 포함된다. 같은 엔진에서 이와같은 부하작업을 지원하므로 스파크는 데이터 분석 파이프라인에 (종종) 필요한 다른 타입의 프로세싱을 합치는 것이 쉽고 저렴하다. 또한 별도의 도구들을 유지하는 운영 부담을 줄여준다.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.

통합 스택

The Spark project contains multiple closely integrated components. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to interoperate closely, letting you combine them like libraries in a software project.

A philosophy of tight integration has several benefits. First, all libraries and higher-level components in the stack benefit from improvements at the lower layers. For example, when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. These costs include deployment, maintenance, testing, support, and others. This also means that each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. This changes the cost of trying out a new type of data analysis from downloading, deploying, and learning a new software project to upgrading Spark.

Finally, one of the largest advantages of tight integration is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously, analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured logfiles). In addition, more sophisticated data engineers and data scientists can access the same data via the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the while, the IT team has to maintain only one system.

Here we will briefly introduce each of Spark’s components, shown in Figure 1-1.