Commit ab75cc92 authored by DANIEL DIAZ SANCHEZ's avatar DANIEL DIAZ SANCHEZ
Browse files

Update README.md

parent 9ddf86af
Pipeline #47 canceled with stages
# 3-spark-batch
# Big Data, batch con spark
### Prerequisitos
* eclipse (disponible en el laboratorio)
### Repositorio
En el repositorio dispones de los ficheros java así como texto de entrada (parte del quijote para probarlo).
## WordCount con spark
Spark, al igual que hadoop, funciona con varios nodos de cómputo y puede usar diferentes almacenamientos (como HDFS), aunque es posible usar spark localmente usando el almacenamiento de la máquina así como un único nodo.
Para las rutas de los ficheros de entrada, usaremos diferentes URLs del tipo:
* `file:///home/user/mifichero_entrada.txt` para el caso local
* `hdfs://namenode:port/path` para ficheros almacenados en HDFS
* `path` para rutas relativas al directorio de instalación (sin esquema)
Para usar el nodo local, cuando configuremos el contexto spark, usaremos `local[numeroNodos]`
### Proyecto y dependencias
* crea un proyecto Java en eclipse, convierte el proyecto en Maven Project.
* Añade las siguientes dependencias
```xml
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.2</version>
</dependency>
</dependencies>
```
* Crea una clase llamada JavaWordCount con este código:
```java
package cdist;
import java.util.Arrays;
import java.util.Iterator;
import scala.Tuple2;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
/* This example has been taken from Oreally examples */
public class JavaWordCount {
public static void main(String[] args) throws Exception {
// Set the input and output default files
// it can be a file from hdfs -> hdfs://server:port/path
// or a local file file:///path
// or a relative local file "fileName" under the working directory (ie. out)
String inputFile = "file:///var/home/lab/asig/labgcd/workspace-cdist-spark-and-streaming/spark-aptel/in.txt";
String outputFile = "out";
// let the user add params optionally to define input and output file
if(args.length > 2)
{
inputFile = args[0];
outputFile = args[1];
}
// Create a Java Spark Context, for the application with name "wordCount", use a local cluster
// (for using a existing cluster just substitute "local" with the name of the machine
JavaSparkContext sc = new JavaSparkContext(
"local", "wordcount", System.getenv("SPARK_HOME"), System.getenv("JARS"));
// Load our input data.
// will create an inmutable (RDD) set of strings (one per line)
JavaRDD<String> input = sc.textFile(inputFile);
// Split up into words.
// make a map (line -> words in that line) and make it flat (so a sequence of words irrespectively of their line)
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String x) {
return Arrays.asList(x.split(" ")).iterator();
}
});
// Transform into word and count.
// associate 1 per word
// and then reduce by adding all the numbers per word (key)
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2<String, Integer>(x, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
}
});
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile);
}
}
```
Analiza el código, y pruébalo.
### Notación delta
Java 8 soporta notación delta (`->`) que facilita la programación y la lectura. Esta misma clase puede programarse con notación delta. Pruebalo.
* Crea una clase JavaWordCountDelta con el siguiente código:
```java
package cdist;
import java.util.Arrays;
import java.util.Iterator;
import scala.Tuple2;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
/* This example has been taken from Oreally examples */
public class JavaWordCountDelta {
public static void main(String[] args) throws Exception {
// Set the input and output default files
// it can be a file from hdfs -> hdfs://server:port/path
// or a local file file:///path
// or a relative local file "fileName" under the working directory (ie. out)
String inputFile = "file:///var/home/lab/asig/labgcd/workspace-cdist-spark-and-streaming/spark-aptel/in.txt";
String outputFile = "out";
// let the user add params optionally to define input and output file
if(args.length > 2)
{
inputFile = args[0];
outputFile = args[1];
}
// Create a Java Spark Context, for the application with name "wordCount", use a local cluster
// (for using a existing cluster just substitute "local" with the name of the machine
JavaSparkContext sc = new JavaSparkContext(
"local", "wordcount", System.getenv("SPARK_HOME"), System.getenv("JARS"));
// Load our input data.
// will create an inmutable (RDD) set of strings (one per line)
JavaRDD<String> input = sc.textFile(inputFile);
// Split up into words.
// make a map (line -> words in that line) and make it flat (so a sequence of words irrespectively of their line)
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String x) {
return Arrays.asList(x.split(" ")).iterator();
}
});
// Transform into word and count.
// associate 1 per word
// and then reduce by adding all the numbers per word (key)
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2<String, Integer>(x, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
}
});
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile);
}
}
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment