- 
                Notifications
    
You must be signed in to change notification settings  - Fork 705
 
Why pack unpack and not toList[]
        oleksii iepishkin edited this page Feb 6, 2014 
        ·
        8 revisions
      
    The field based API toList should not be used if the size of the list in a groupBy is very large/not known in advance. toList doesn't decrease the data size significantly, and it stands a good chance of creating OOM errors if the lists get too long.A good alternative to toList is to use pack/unpack and reduce. Use pack to convert the tuples into an object, then do a groupBy with a reduce function inside it and have your logic to process the grouped items, combine them etc.
Example 1:
val res_pipe= inputpipe.groupBy('firstname){
 .toList['lastname]
}
Example 2:
case class Person(firstname: String="", lastname: String = "")
val res_pipe= inputpipe.flatMap(('firstname,'lastname)->('firstname,'person)){
in: (String, String) =>
val (firstname,lastname) = in
val person= Person(firstname= firstname,lastname= lastname)
(firstname,person)
}
.groupBy('firstname){
.reduce('person->'combinedperson){
      (personAccumulated: Person, person: Person) =>
       val combined_lastname_person= Person(
       firstname= personAccumulated.firstname,
       lastname= personAccumulated.lastname + ","+ person.lastname,
       )
       combined_lastname_person
}.unpack"["Person"]"('combinedperson->('firstname,'lastname))
//comma separated last names
}
- Scaladocs
 - Getting Started
 - Type-safe API Reference
 - SQL to Scalding
 - Building Bigger Platforms With Scalding
 - Scalding Sources
 - Scalding-Commons
 - Rosetta Code
 - Fields-based API Reference (deprecated)
 
- Scalding: Powerful & Concise MapReduce Programming
 - Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
 - Scalding REPL with Eclipse Scala Worksheets
 
- Scalding with CDH3U2 in a Maven project
 - Running your Scalding jobs in Eclipse
 - Running your Scalding jobs in IDEA intellij
 - Running Scalding jobs on EMR
 - Running Scalding with HBase support: Scalding HBase wiki
 - Using the distributed cache
 - Unit Testing Scalding Jobs
 - TDD for Scalding
 - Using counters
 
- Scalding for the impatient
 - Movie Recommendations and more in MapReduce and Scalding
 - Generating Recommendations with MapReduce and Scalding
 - Poker collusion detection with Mahout and Scalding
 - Portfolio Management in Scalding
 - Find the Fastest Growing County in US, 1969-2011, using Scalding
 - Mod-4 matrix arithmetic with Scalding and Algebird
 - Dean Wampler's Scalding Workshop
 - Typesafe's Activator for Scalding