El Rincon - Marcelo's Corner: 2014

Monday, December 29, 2014

BDD in Action - Book Review

Finished reading Manning's BDD in Action (behavior-driven development) by John Ferguson Smart which I found it very insightful. There are four things that struck me about this book:

Don't write unit tests, write low-level specifications
Favor outside-in development
Learned about Spock framework
There is a difference between a story and a feature

I used RSpect before but without a clear understanding of BDD, so I wrote unit tests (test scripts) rather than low-level specifications. The book explains why BDD is important along with details steps and examples. BDD is when we write behavior and specification that then drive the software. One of the key goals of BDD is to ensure that everyone has a clear understanding of what the project is trying to deliver, and of the underlying business objective of the problem. BDD is TDD but with better guidelines or even total new approach to developing. This is why wording and semantics are important: the tests need to clearly explains the business behavior they're demonstrating. It encourages people to write tests in terms of the expectations of the program's behavior in a given set of circumstances.

When writing user stories we were told to use this template:
As a <stakeholder> I want <something> so that I can <achieve some business goal>.

Once you have a story, then you have to explore the details by asking the users and other stakeholders for concrete examples.

In BDD the following notation is often used to express examples:
Given <a context>: describes the preconditions for the scenario and prepare the test environment
When <something happens>: describes the action under the test.
Then <you expect some outcome>

Example:
Story: Returns go to stock

In order to keep track of stock
As a store owner
I want to add items back to stock when they're returned

Scenario 1: Refunded items should be returned to stock
Given a customer previously bought a black sweater from me
And I currently have three black sweaters left in stock
When he returns the sweater for a refund
Then I should have four black sweaters in stock

Scenario 2: Replaced items should be returned to stock
Given that a customer buys a blue garment
And I have two blue garments in stock
And three black garments in stock.
When he returns the garment for a replacement in black,
Then I should have three blue garments in stock
And two black garments in stock

As the book Specification by Example mentioned, instead of waiting for specifications to be expressed precisely for the first time in the implementation, successful teams illustrate specifications using examples. The team works with the business user or domain experts to identify key examples that describes the functionality. During this process, developers and testers often suggest additional examples that illustrate the edge cases or address areas of the system that are particular problematic. This flushes out functional gaps and inconsistencies and ensure that everyone involved has a share understanding of what needs to be delivered, avoid rework that results from misinterpretation and translation.

Besides understanding the difference of unit tests and specifications, the book also talks about the difference of features vs. user stores. They are NOT the same. A feature is a functionality that you deliver to the end users or to the other stakeholders to support a capability that they need in order to achieve their business goals. A user story is a planning tool that helps you flesh out details of what you need to deliver for a particular feature. You can have features without having stories. Is a matter of fact, a good practice is to summarize the "Given When" sections of the scenario in the title and avoid including any expected outcomes. Because scenarios are based on real business examples, the context and events are usually stable, but the expected outcome may change as the organization changes and evolves the way it does business.

Besides the language syntax, I discovered the Spock framework. It lets you write concise and descriptive tests with less boiler plate code than would be needed using java. The syntax encourages people to write tests in terms of your expectations of the program's behaviors in a given set of circumstances.

Example:

While I was reading this book, two quotes came to my head:

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live - M. Goldin.

Programs must be written for people to read, and only incidentally for machines to execute - Adbson and Sussman.

The other insightful thing that I learned is that BDD favors an outside-in development approach. Which they include:

Start with a high-level acceptance criterion that you want to implement
Automate the acceptance criterion as pending scenarios, breaking the acceptance criterion into smaller steps
Implement the acceptance criterion step definition, imagining the code you'd like to have to make each step work
Use these step definition to flesh out unit tests that specify how the application code will behave
Implement the application code, and refactor as required

There are many benefits of outside-in development, but the principle motivations are summarized here:

Outside-in code focuses on business value
Outside-in code encourages well-designed, easy to understand code
Outside-in code avoid waste

As I mentioned, I enjoy the book and I found it very insightful. Some (if not all) the ideas of the book has been around for decades. I believe that this book is great for an architect, programmer, testers, project manager, product owner, and scrum masters.

Monday, November 10, 2014

Validate Map Collections via Matchers

I was introduced to Hamcrest Matchers by the 3C team. I am really liking it. Today, I stumbled on a validation of a Map. Here is how I usually solved the problem and then how I solved it using the syntax sugar of Matchers.

Sunday, November 9, 2014

Lambdas and Java 8

Java 1.8 introduces the concept of streams, which are similar to iterators.

Why Lambdas are good for you:

Form the basis of functional programming language
Make parallel programming easier
Write more compact code
Richer data structure collections
Develop cleaner APIs

Lambdas Expression Lifecycle - think of them as having a two stage lifecycle:

Convert the lambda expression to a function.
Call the general function

Streams have two types of operations: intermediate and terminal.
Intermediate operation: specifies tasks to perform on the stream's elements and always results in a new stream.
filter: Result in a stream containing only the elements that satisfy a condition.
distinct: Result in a stream containing only the unique element.
limit: Result in a stream with the specified number of elements from the beginning of the original stream.
map: Result in a stream in which each element of the original stream is mapped to a new value (possibly of a different type).
sorted: Result in a stream in which the elements are in sorted order. The new stream has the same number of elements as the original stream.

Terminal operations initiates processing of a stream pipeline's intermediate operations and produces results.
forEach: Performs processing on every element in a stream.
average: Calculates the average of the elements in a numeric stream.
count: Returns the number of elements in the stream.
max: Locates the largest value in a numeric stream.
min: Locates the smallest value in a numeric stream.
reduce: Reduces the element of a collection to a single value using an associative accumulation function (e.g. a lambda that adds two elements -- in Scala this is the "map" operator).

Mutable reduction operations: creates a container (such as a collection or StringBuilder)
collect: Creates a new collection of elements containing the results of the streams's prior operations.
toArray: Creates an array containing the results of the stream's prior operations.

Search operations
findFirst: Find the first stream element based on the prior intermediate operations; immediately terminates the processing of the stream pipeline once such an element is found.
findAny: Finds any stream element based on the prior intermediate operations: immediately processing of the stream pipeline once such an element is found.
anyMatch: Determines whether any stream elements match a specified conditions; immediately terminates processing of the stream pipeline if an element matches.
allMatch: Determines whether all of the elements in the stream match a specified condition.

Examples:

Refactor:

This has done so many different changes to some of my code. Here are some example of before and after: Before: I wanted to printout some of the results and so I leveraged Spring's CommandLineRunner.

Here's the after code: Another example fetching a collection of records from a database: I can refactored it doing this:

Thursday, October 16, 2014

Changing Java SDK in IntelliJ IDEA 13

We just migrated to Java 1.8. In my personal computer, I installed the JDK 1.8 and make sure that Maven was running fine in using the latest Java version. I'm using a Mac, so when I ran it in Terminal everything worked. However, when I ran it in my IntelliJ IDEA, it said that it was running Java 1.7.

To changed it, go to the following menu: File, Project Structure, then click on Project.
Here is where you can set the SDK for your project. Just change it to the right SDK and that's it.

Happy coding.

Wednesday, September 3, 2014

Learning Cassandra

I just finished reading Practical Cassandra. I enjoyed this book and it helped with my presentation at Rokk3r Labs in Miami Beach. You can tell that Russell Bradberry and Eric Lubow spent sometime thinking about this book. I like that it's straight to the point for a developer, but it is also useful for sysadmins and managers. I enjoyed the troubleshooting and "use cases".

The book mentions, "where Cassandra fits in". This is a question that I constantly get when talking about Cassandra. Many people want to know, "why not [NoSQL database of your choice]?". The short answer is: if you want fast writes, multi-data center support baked into your system, a truly scalable system with tons of metric, then you should consider Cassandra. However, I always follow my answer by saying that the best way to know if Cassandra fits into the role, is to understand it. When I started using it, I had to stop myself thinking about all that I know about data modeling with RDBMS. Most of the stuff that we learned in RDBMS is actually an anti-pattern for Cassandra - normalization, build your model first, index with high-cardinality, leverage joints. Don't think of a relational table, think of a nested, sorted map data structure.

Tunable Consistency and Polyglot Databases

Many people don't understand that you can tune the consistency of Cassandra. The followings are the configuration that you can have for reads and writes:

ANY: is for writes only and ensures that the write will persists on any server in the cluster.
ONE: ensures that at lease one server within the replicate set with persist the write or respond to the read
QUORUM: means the read/write will go to the half of the nodes in the replica set plus one.
LOCAL_QUORUM: it's just like "quorum" except that it is only for the nodes in that data center.
EACH_QUORUM: is like "quorum" but ensures a quorum read/write on each of the data centers.
ALL: ensures that all nodes in a replica set will receive the read/write.

One of the things Cassandra does not do, is joins or ad-hoc queries. This is a something that Cassandra simply doesn't do and other tools do it better (Solr, ElasticSearch, etc). This is what people are calling to Polyglot Data.

Gossip vs Snitch

Practical Cassandra helped me understand the difference between the "gossip" and "snitch" protocol. This is something that I struggled time and time again. Gossip is the protocol that Cassandra uses to discover information about the new nodes. When bringing a new node into the cluster, you must specify a "seed node". The seed nodes are a set of nodes that are used to given information about the cluster to newly joining nodes. As you can imagine, the seed nodes should be stable and should point to other seed nodes.

The snitch protocol helps map IPs to racks and data centers. It creates a topology by grouping nodes together to help determine where data is read from. There are few types of snitches: simple, dynamic, rack interfering, EC2, and Ec2MultiRegion.

Simple snitch is recommended for a simple cluster (one datacenter as one zone in a cloud architecture).
Dynamic snitch wraps over the SimpleSnitch and provides an additional adaptive layer for determining the best possible read location.
RackInferringSnitch works by assuming it knows the topology of your network, by the ocftets in node's IP address.
EC2Snith snitch EC2 Snitch is for Amazon Web Service (AWS)-based deployments where the cluster sits within a single region.
EC2MultiRegionSnitch is for AWS deployments where the Cassandra cluster spans multiple regions.

Node Layout

Prior to Cassandra 1.2, one token was assigned to each node. Whenever you had a node that would have a lot of load of data, that would be consider a "hot spot". Most of the times, you will just add another node to leverage the "hot spot", but then you had the "rebalance" the cluster. Virtual nodes or vnodes, provide a Cassandra node with the ability to be responsible for many token ranges. Within a cluster, they can be noncontiguous and selected at random. This provide a greater distribution of data than the non-vnode paradigm.

Performance

The performance chapter was also another very interesting chapter. Being a developer it introduced me to common *nix tools like vmstat, iostat, dstst, htop, atop, and top. All of these tools provide a picture of usage. It also explained how instrumentations goes a long way. Also, if one node becomes too slow to respond, the FailureDector will remove it.

An easy optimization for Cassandra is putting your CommitLog directory on a separate drive from your data directories. CommitLog segments are written to every time a MemTable is flushed to disk. You can do this setting in the cassandra.yml by setting the data_directory and commitlog_directory.

Metrics

Cassandra goes out of her ways to provide lots of metrics. With all these metrics you can do capacity planning. Once you start getting all these metrics, you'll be able to differentiate trends and be able to proactively add or remove nodes. For example, you can monitor the PendingTask under the CompactionManagerMBean to know the speed and volume with which you can ingest data, you will need to find a comfortable set of threshold for your system. Another example is to monitor the high request latency, which can indicate that there is a bad disk or that your current read pattern is starting to slow down.

These are some of the metrics that you can get via JMX:

DB: monitors the data storage of Cassandra. You can monitor the cache and the CommitLogs, or even information about the ColumnFamily.
Internal: these cover the state and statistics around the staged architecture (gossip information and hinted handoffs).
Metrics: these are client request metrics (timeouts and "unavailable" errors).
Net: these metrics monitored the network (failure detector, gossiper, messaging service, and streaming service).
Request: these are metrics about request from the client (read, write, and replication).

There are still a lot of stuff that I need to learn about Cassandra. Specially about the data model. It's very tricky to start thinking about your queries (pre-optimized queries like Nate McCall calls them). In all, the book does covers the basics .

Friday, July 11, 2014

Case for Code Reviews

I have been doing code reviews for about six months at 3CInteractive. Since I'm so new at it, it's hard for me blog about "best practices" or even do a presentation on code reviews. Therefore, I decided to have an open space meetup at our Miami JVM Group. The open space will be around "effective code review process". I'm hoping to learn about the following: Who is using it? Who are not using it and why? Who thinks they're not useful and why? What have they learned? What were some of their challenges? This post is based on my experience and some of my lessons learned while working with internal and offshore teams.

Benefits of Code Reviews

You write better code when you know it will be reviewed.
A second (or third, or fourth) set of eyes will help spot defects. This is very similar to pair programming, but it works even better if you're working with an offshore team. It's also a great way of learning new APIs. For example, someone could tell me that my code can be easily done using the Guava library, or that the code is actually an "aggregator" Enterprise Implementation Pattern, and I should probably look at Camel.
More than one person understand your code (cross-pollination or avoid silos). Having more than one person look at your code helps spread the knowledge and context of the problem or solution.
Reducing the learning curve for new developers. I believe that even junior developers should be part of the code review process. It's a great way for them to learn about the code base and they become more productive.

What To Look For

Bad design. Highlight issues such as SQL injection, look out for lack of design patterns, or anti-patterns. Things like separations of concern, encapsulation, and apply certain basic OOD principles - DRY, encapsulates what varies, open-close principle, etc.
Performance hazard. For example, memory leaks.
Lack of clarity - the application should work and the code should be readable. For example, a class named "SomethingServiceImpl" with no documentation on the class will be highlighted and will prompt a change request to the developer. Also, a big nested if statement that is not quiet clear will prompt a change request.
Not consistent or not according to standards. Having a set of standards makes code reviews a lot simpler. It also sets a norms for the team. For example, not using Domain-Drive Development standards, consolidating APIs (Guava vs Jakarta Commons), having a handful of languages, and having a code style rules.

What Not To Look For

Premature optimization. Don't try to optimize all of it at once. As my buddy Tyler mentions, "make it work, then make it better".
Skills and expertise gaps. In our company, we allow all developer to do code reviews, including junior developers. These developers gain a lot of knowledge about doing code reviews.
Personal style. If the CTO goes over and says, "I wouldn't do it like that" it bring noise to the process.

Quality and Dissemination of Knowledge

My team is responsible for improving code quality, lower defects in code, improve communication about code content, and teaching and mentoring of junior developers. Code reviews has helped us on shorter development cycles, more customer satisfaction, and more maintainable code. But most important, it has help us spread the knowledge and norms. As per the book 97 Things Every Software Architect Should Know,

Chances are your biggest problem isn't technical

Most projects are built by people, and those people are the foundation for success and failure. So, it pays to think about what it takes to help make those people successful.

Tuesday, March 25, 2014

Find Type of Vagrant VM is Running

I am really enjoying Vagrant. It's one of those tools that are indispensable. However, today I wanted to install a CentOS VM in my application and I didn't remember the version name that I was using in my other VMs. To find out, the only thing that you have to do is to check a previous VM. Here's an example:
vim ~/vagrant_boxes/kafka/Vagrantfile
You will be able to see the version inside the file:

Friday, March 21, 2014

Strata 2014 - Newbie Perspective

Marc Andreessen noticed that software is eating the world. I see the same thing with Big Data. Big Data is shaping the world around us. It has been used on presidential elections, weather reports, consumer analysis/sentiment, fraud check, etc. Strata conference is the epicenter of new technologies, use cases, and new innovations related to Big Data. I've been meaning to go there for quite some time. Previously, I purchased the videos from O'Reilly because I couldn't make it. Thanks to my current company, 3C (they're pretty awesome), I was able to go along with five of my coworkers. It's the place where you can meet the experts, the main committers, and ask them questions. If your eyes get dilated when you talk of Hadoop, or you get exited when you need to solve a problem that has to do with a huge amount of data including the famous "three V's" (volume, velocity, and variety), then this conference is for you. This is a quick summary of my experience of the conference.

The conference revolved around four clusters:

How quickly can you get the data into your system (ingest)
How fast can you show the results
It's all about presentation (charts)
Big Data doesn't mean Hadoop

How Quickly Can You Get Data

The presentation that left me mesmerized was Spark! I can't wait to use it. It is a very compelling product and it's now backed up by Cloudera. With Spark you can do the following:

Get a compute engine for Hadoop data - no need to reinvent the wheel
Speed up! A 100% faster MapReduce engine
Sophisticated: it runs all the sophisticated algorithms. Get access to a library of sophisticated algorithms
A a big community behind it; the most popular Big Data open source (followed by Hadoop)
Learning from the big guys - Yahoo!, Conviva, and Cloudera are using it

Not to mention that it comes integrated with a analytic suite (Shark), a large-scale graph processing (Bagel), and real-time analysis (Spark Streaming). This is nice because rather than doing Hive, Hadoop, and Mahout, and Storm, I only have to learn one programming paradigm.

How Fast Can You Show The Results

Twitter explains how they monitor millions (+5,700 tweets per second) of Time Series. The presentation was superb. I found out that the stack that they're using, named "Observability", is composed on: Finnagle, Cassandra, and query language and execution engines based on Scala. Although is a work in progress the stack is about three years old. I hope that they open-sourced it stack so I can get more context on how they monitor a large distributed system.

Another very interesting product was Google's Big Query. This was one of those presentations in which we (my team and I) stumbled upon by accident. The presentation showed how to use Google's toolkit: Freebase, Maps, and BigQuery to do analytics.

It's All About Context, Results, or Charts

Another company that impressed me was Trifacta. With their tool you can clean data, see the model (graph) and recursively do it again in case you see patterns or not. The tool is targeted to data scientists, data wranglers, and data analysts. It's a great tool to mine data data, but most important, you can clean the data and show the results with relative ease.

IPython: This rekindled my interest in Python. IPythons notebooks are great for data scientists. You can get code, text, and graphics all in one page, so it's the perfect tool to show quick results. It's not that Python wasn't a popular language for data scientists. NumPy library provides a solid MATLAB-like matrix data structure, with efficient matrix and vector operations. It also provides other great APIs like SciPy and Pandas.

Big Data != Hadoop

Two topics that opened my eyes were Mesos and YARN. Mesos, what Twitter uses to manage its clusters, is similar to YARN (Yet Another Resource Negotiator). The Hadoop 2.0, or YARN, it's becoming more of an environment and operating system; not just a MapReduce. With YARN, the JobTracker is gone. The ResourceManager is what does the job of the JobTracker. The ResourceManager (RM) is a scheduler - it allocates resources based on a pluggable scheduling algorithm. RM manages and monitors all the applications, so it strictly limits to arbitrating available resources.

One of our favorite (me and two of my buddies), was Netflix Data Platform by Kurt Brown. A different and a great presentation. Rather than going on the technology side, they explained how the culture is intertwined with their technology stack or decisions. For example, they talked about the reason for using "the cloud". Obvious reasons like: it's cheaper, much flexible (growth, a better place to do tests/spikes), and having multi data center is definitely a plus. Also, Amazon and RackSpace have great services such as SQL, EMR, and S3. But the main reason is "focus". They are focused on getting movies and increasing their audience rather than to focus on the "plumbing". They expressed their commitment to "open-source software" (OSS). They mentioned the great talent that they can get and how they can "manage their own destiny" by following these principles and using these tools.

Netflix explained their philosophy and how it's the "soul" of their decision (technical and business). For example, they keep keyboards, mice, and other peripherals in vending machines (they are free), so that everyone knows to "act in Netflix best interest". Furthermore, every decision or project needs to answer a basic question: "what value are you adding?". They apply the rule "accept that things will break". Because of this, they build safety nets around their systems. Again, it was a very nice and interesting presentation.

I really enjoyed the conference. I also just purchased the videos. Which I highly recommend!! During the next few months, I'm going to try to learn some of these tools and present them at the Miami JVM Meetup. Hopefully I can get to see you there, or better yet, hope to see you at Strata 2015. If you're going to either one of these events, let's meet up and share a beer...or two and discuss Big Data. I promise that my eyes will get dilated.

Monday, March 17, 2014

El Dilema de Ser Buen Samaritano o Come Mierda

Siempre me a gustado ayudar a la gente, pero hay veces me pongo a pensar…soy "buen samaritano" o un "come mierda"? Desde pequeño me gusto salir y hablar. Mi mamá siempre me dijo que hablo “hasta por los codos”. Lo que he notado es que ahora, muchas personas me suelen hablar y pedirme por cosas. Algunas personas se acercan a mi para venderme algo, y otras para ayudarlas. La verdad es que muchas personas dicen que yo soy un muchacho “agradable”, otras personas dicen que soy “simpatico” (me gusta), yo a veces pienso que tengo cara de "come mierda”. Por ejemplo, es común que cuando voy a un mall, siempre las personas que tienen una tienda en un kiosco, siempre me llaman, “señor, le limpio el reloj?" o "Joven, tengo esto en especial.” Siempre termino en decir, “no gracias" con mi sonrisa y sigo adelante, para que…para encontrarme con otros dos muchacho/as que me van a preguntar lo mismo. Esto es muy común para mi. Mi esposa siempre me dice que vea abajo para que no me persigan, pero hasta eso! La otra ves, caminando en un estacionamiento con mi familia, una señora me paro en plena calle y me llamo. Luego me dijo, “me puedes hacer un favor, se me desamarro mi zapato, me lo amarras?” Y que es lo que hice? Pues lo amarre…como buen come mierda. Para consolarme, la señora era una anciana obesa. Pero aun así, de todas las personas, tuve que ser yo? Mi ultima escena de “buen samaritano” fue la otra ves que fui a desayunar con mi familia. Apenas salí de el carro, un señor me vio y me dijo que su carro lo encerro con las llaves dentro de el vehículo. Yo se, medio bruto el personaje, pero también yo soy super despistado - lo entiendo. “Me puedes llevar a casa para agarrar mis llaves de repuesto? Vivo bien cerca.” Mi esposa solo me vio, me dio un sonrisa, y me dijo que mientras iba a agarrar la mesa con los niños.

La verdad muchas veces me pongo a pensar, “voy a ponerme así, todo cabrón y mandar al diablo a esa gente”. Pero no es como soy yo. Como dije, me encanta hablar! Cuando hice mi ultima “labor” de buen samaritano, le pregunte a el señor (algo mayor) que cuantas personas le había preguntado. El me respondió fui el primero, “tienes la pinta de ser amable”. Yo pensé, "mas bien, cara de come mierda.”

Después de tanto tiempo así, viéndolo en retrospectiva, no solamente me da mucho agrado ayudar, pero también me ha ido muy bien con mis pequeñas labores. Como dije, se siente bien ayudar a la gente. Ademas, creo que da un buen ejemplo a mis dos hijos (tengo uno de doce y otra de 3 que cree que tiene trece). Y más aun cuando lo haces sin pensarlo mucho o pedir algo en cambio. Aunque nunca he pedido nada, siempre termino notando cosas bonitas. Como cuando el señor que dejo las llaves en el carro pago mi desayuno, en el caso de el año pasado alguien me compro un par de zapatos de $120, solamente porque tenia cara de buena gente. Al parecer, a los de cara de come mierda, tiene mucha suerte.

Un abrazo!
Marcelo

Thursday, March 6, 2014

Disruptive Possibilities: How Big Data Changes Everything

I was looking forward to this book because of the title. I was under the impression that I was going to find concrete examples on how Big Data has affected and disrupted some industries. Best of all, I thought that I was going to read what industries will be impacted and how. The book showed some examples at the end, but in my opinion, it leaves something very important: speed and sophistication.

I just came back from Strata 2014, which is why I was looking forward to this book, and when I heard Matei Zaharia's keynote, it was all I needed to know about the current disruption of big data. Nowadays, big data storage is becoming commoditized, so the best value added is speed (how quick you can get the answer of your problem) and sophistication (run the best algorithms on the data). The book doesn't mentioned this but it might be because of its age - things are moving super quick on Big Data.

Some of the things that the book does well:

Introduces some history about the Big Data problem
How it affected some of the silos technologies like RDBS
How they solve the scalability issue

If you are a manager or someone that has no understanding of the world of Big Data, then I would recommended. However, if you are a developer, data scientist, or data wrangler, then this book will be too basic. The one thing that I highly recommend, if you are interested in this subject, is to attend (or at least purchase the videos) of Strata.

You can get the book here.

Happy reading,

Marcelo