Posts Tagged ‘programming’

Intro to Scala for Java Developers – slides

Monday, August 17th, 2009

Thought I’d post the slides of a talk I gave at work on Scala. We’re primarily a Java shop, and every week we do either a code review or a tech-related presentation.

Our domain at work is analyzing residential energy data, so the examples herein are tailored to that:

  • Read or Meter Read – Some amount of energy used over a period, e.g. “100kwh in the month of June”
  • Service Point – meta-data about an electric meter (the “point” at which “service” is available).

I also omitted a code demo where I refactored part of our codebase into Scala to show the difference (trust me, it was awesome!).

Simple Metrics for Team and Process Improvement

Monday, June 29th, 2009

Recently, the development team where I work has started collecting bona-fide metrics, based on our ticketing system. So few development shops (especially small ones) collect real information on how they work that it’s exciting that we’re doing it.

Here’s what we’re doing:

  • Number of releases during QA (we do a daily release, so more than daily is an indicator)
  • Defects found, by severity and priority
  • Average time from accepting a ticket (starting work) to resolving it (sending it for testing)
  • Number of re-opens (i.e. a defect was sent to testing, but not fixed)
  • Average time from resolving to closing (i.e. testing the fix)
  • Defects due to coding errors vs. unclear requirements (this is really great to be able to collect; with our company so new and small, we can introduce this and use it without ruffling a lot of feathers)

The tricky thing about metrics is that they are not terribly meaningful by themselves; rather they indicate areas for focussed investigation. For example, if it takes an average of 1 day to resolve a ticket, but 3 days to test and close it, we don’t just conclude that testing is inefficient; we have to investigate why. Perhaps we don’t have enough testers. Perhaps our testing environment isn’t stable enough. Perhaps there are too many show-stoppers that put the testers on the bench while developers are fixing them.

Another way to interpret these values is to watch them over time. If the number of critical defects is decreasing, it stands to reason we’re doing a good job. If the number of re-opens is increasing, we are packing too much into one iteration and possibly not doing sufficient requirements analysis. We just started collecting these on the most recent iteration, so in the coming months, it will be pretty cool to see what happens.

These metrics are pretty basic, but it’s great to be collecting them. The one thing that can make hard-core analysis of these numbers (esp. over time as the team grows and new projects are created) is the lack of normalization. If we introduced twice as many critical bugs this iteration than last, are we necessarily “doing worse”? What if the requirements were more complex, or the code required was just…bigger?

Normalizing factors like cyclomatic complexity, lines of code, etc, can shed some more light on these questions. These normalizing factors aren’t always popular, but interpreted the right way, could be very informative. We’re the same team, using the same language, working on the same product. If iteration 14 adds 400 lines of code, with 3 critical bugs, but iteration 15 adds 800 lines of code with 4 critical bugs, I think we can draw some real conclusions (i.e. we’re getting better).

Another interesting bit of data would be to incorporate our weekly code review. We typically review fresh-but-not-too-fresh code, mostly for knowledge sharing and general “architectural consistency”. If we were to actively review code in development, before it is sent to testing, we could then have real data on the effectiveness of our code reviews. Are we finding lots of coding errors at testing time? Maybe more code reviews would help? Are we finding fewer critical bugs in iteration 25, than in iteration 24 and 23, where we weren’t doing reviews? Reviews helped a lot.

These are actually really simple things to do (especially with a small, cohesive team), and can shed real light on the development process. What else can be done?

I can haz job

Monday, December 1st, 2008

So, I am finally employed and I didn’t even have to settle. After a refreshingly protracted and detailed interview process, I’m finally schlepping myself to a job that I’m more or less excited about. That’s saying something, since I’ve spent the last 8 months at home (6.5 of them working for Gliffy) in my perfect environment: waking up whenever, using my dual-monitor mac, Rudy close by. My first day was a net win, despite having to bring in my own computer, and overall I’m not complaining because I get to use a Mac at work thank GOD.

Pluses so far:

  • Smart people I can have a conversation with
  • Meaningful product (i.e. not another CRUD app for a government agency [not that there's anything wrong with it])
  • Not only have they heard of javadoc, they use it!
  • Database migrations!
  • Clean looking code and tests that actually pass on a fresh checkout!
  • No M$ exchange server or other shitbox mail system (they use Google Apps)
  • Damn close to home; I should be biking in real soon
  • Relaxed environment
  • I’m one person away from a bonafide window with the shades open!

Honestly, it’s almost a 100% on my interview rubric (which I took down for a while, because some HR person read it and gave me shit about not liking having a dress code. I mean, does anyone really like putting on a suit and tie to site and write code? Or to do anything? We’re talking levels of tolerance, and mine is low, mostly because I believe dress codes indicate a deeper organizational problem of priority management).

Negatives so far:

  • Kinda noisy office (fortunately few people seem to have phones)
  • Subversion (it looks like they aren’t going nuts with branches, so git-svn should preserve my sanity in this regard)

On the fence so far:

  • Maven – The only reason this isn’t a negative is that it’s better than the pile of shit ant script everyone else uses, and the build does work pretty painlessly.
  • Spring – I haven’t used Spring for anything real, and I can’t say it gets me excited (nor have I ever thought it sounded all that great), but I’m optimistic about it. I figure if it, in fact, is great, I’m happy. If it sucks, I have fodder for ranting. It’s a win/win. I do fear the XML situps tho.

Ruby and dead simple code coverage

Tuesday, October 14th, 2008

I haven’t used a code coverage tool for Java, but in my spare time I’ve been working up some Ruby code (mostly to learn the language). I’m using Test Driven Development, which is slightly simpler with Ruby than with Java (mostly due to Ruby’s interpreted nature).

I had heard about rcov and decided to give it a shot. Within literally 5 minutes I had it installed and a report showing my tests were not covering all my code! Amazing. I could then easily see exactly where I need to to test and, sure enough, found some bugs that would’ve gone unnoticed.

Even the best tool with Java would’ve required some painful ant tweaking (not to mention hopes and prayers that it worked with TestNG). I already cannot imagine writing tests without being able to view the coverage….

Using ThreadLocal and Servlet Filters to cleanly access JPA an EntityManager

Wednesday, May 14th, 2008

My current project is slowly moving from JDBC-based database interaction to JPA-based. Following good sense, I’m trying to change things as little as possible. One of those things is that we are deploying under Tomcat and not under a full-blown J2EE container. This means that EJB3 is out. After my post regarding this configuration, I quickly realized that my code started to get littered with:

EntityManager em = null;
try
{
  em = EntityManagerUtil.getEntityManager();
  // do stuff with entity manager
}
finally
{
  try {
    if (em != null) em.close();
  } catch (Throwable t) {
    logger.error("While closing an EntityManager",t);
  }
}

Pretty ugly, and seriously annoying to have to add 13 lines of code to any method that needs to interact with the database. The Hibernate docs suggest using ThreadLocal variables to provide access to the EntityManager throughout the life of a request (which wouldn’t really work for a Swing app, but since this is servlet-based, it should work fine). The ThreadLocal javadocs contain possibly the most annoying example ever, and I didn’t follow how to use it.

Anyway, I finally got around to it, and also solved the close problem as well, by using a Servlet Filter. I guess this type of thing would normally be solvable by Spring or Guice, but I didn’t want to drag all of that into the application to refactor this one thing; I would’ve easily spent the rest of the day dealing with XML confihuration and deployment.

The solution was quite simple:

/** Provides access to the entity manager.  */
public class EntityManagerUtil
{
    public static final ThreadLocal<EntityManager>
        ENTITY_MANAGERS = new ThreadLocal<EntityManager>();

    /** Returns a fresh EntityManager */
    public static EntityManager getEntityManager()
    {
        return ENTITY_MANAGERS.get();
    }
}
public class EntityManagerFilter implements Filter
{
    private Logger itsLogger = Logger.getLogger(getClass().getName());
    private static EntityManagerFactory theEntityManagerFactory = null;

    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
        throws IOException, ServletException
    {
        EntityManager em = null;
        try
        {
            em = theEntityManagerFactory.createEntityManager();
            EntityManagerUtil.ENTITY_MANAGERS.set(em);
            chain.doFilter(request,response);
            EntityManagerUtil.ENTITY_MANAGERS.remove();
        }
        finally
        {
            try
            {
                if (em != null)
                    em.close();
            }
            catch (Throwable t) {
                itsLogger.error("While closing an EntityManager",t);
            }
        }
    }
    public void init(FilterConfig config)
    {
        destroy();
        theEntityManagerFactory =
          Persistence.createEntityManagerFactory("gliffy");
    }
    public void destroy()
    {
        if (theEntityManagerFactory != null)
            theEntityManagerFactory.close();
    }
}

So, when the web app gets deployed, the entity manager factory is created (and closed when the web app is removed). Each thread that calls EntityManagerUtil to get an EntityManager gets a fresh one that persists for the duration of the request. When the request is completed, the entity manager is closed automatically.

Time Machine almost saved me, but git won out in the end

Friday, May 9th, 2008

So, I’m working on a project that’s using Subversion for version control. My network connection isn’t great, plus subversion is slow, plus git is (so far) pretty awesomely awesome. The way to interact with an SVN repository is via git-svn, that I talked about setting up previously. Everything’s been going great, however I don’t frequently commit to subversion. This week, we started setting up continuous integration for my work, so I did an git-svn dcommit, committing two days worth of changes. I had forgotten that I had made so many changes (including adding hibernate support). I misread the commit messages and thought something bad was happening. Control-C. git log. HEAD is recent. Last commit was….yesterday. Oh. Fuck.

I figure git-svn borked something, so I git-rest --hard. No effect. I’m starting to panic, now. almost 2 days of work lost is not something I’m looking forward to. I hasitly go into Time Machine and get the previous hours’ backup. But, I just hate that solution. I have no idea what happened, and my trust in Git (or my ability to use it) has to be restored. After IM’ing with a co-worker, I got to the bottom of it.

It turns out that I wasn’t paying attention to how git-svn works. What it does when you do a rebase or dcommit (which implicitly does a rebase), is to first undo all your changes since your last rebase/dcommit, and get the changes made to the SVN repository (it even says as much as the first line of the output). It then “replays” your commits to make sure there’s no conflicts.

By hitting Control-C in the middle of that, I manually caused the same situation that would happen if there were conflicts. Git stops, tells you to resolve conflicts, and asks you to git-rebase --continue. If I had just git-rebase --continue‘ed, I would be fine. Since I did a hard rest, I figured I was fucked. Enter the log.

.git/logs/HEAD contained information about all activity, including my missing commits. I grab the version numbers (which, in Git, are hashes of the entire repository), do a git-reset --hard big.honkin.git.hash.version and viola! everything’s back to how it was (the command ran instanteously, to boot).

Using Java Persistence with Tomcat and no EJBs

Thursday, May 8th, 2008

The project I’m working on is deployed under Tomcat and isn’t using EJBs. The codebase is using JDBC for database access and I’m looking into using some O/R mapping. Hibernate is great, but Java Persistence is more desirable, as it’s more of a standard. Getting it to work with EJB3 is dead simple. Getting it to work without EJB was a bit more problematic.

The entire application is being deployed as a WAR file. As such, the JPA configuration artifacts weren’t getting picked up. Setting aside how absolutely horrendous Java Enterprise configuration is, here’s what ended up working for me:

  • Create a persistence.xml file as per standard documentation leaving out the jta-data-source stanza (I could not figure out how to get Hibernate/JPA to find my configured data source)
  • Create your hibernate.cfg.xml, being sure to include JDBC conncetion info. This will result in hibernate managing connections for you, which is fine
  • Create a persistence jar containing:
    • Hibernate config at root
    • persistence.xml in META-INF
    • All classes with JPA annotations in root (obviously in their java package/directory structure)
  • This goes into WEB-INF/lib of the war file (being careful to omit the JPA-annotated classes from WEB-INF/classes

The first two steps took a while to get to and aren’t super clear from the documentation.

To use JPA, this (non-production quality) code works:

EntityManagerFactory emf =
    Persistence.createEntityManagerFactory("name used in persistence.xml");
EntityManager em = emf.createEntityManager(); 

Query query = em.createQuery("from Account where name = :name");
query.setParameter("name",itsAccountName);
List results = query.getResultList();

// do stuff with your results

em.close();
emf.close();

The EntityManagerFactory is supposed to survive the life of application and not be created/destroyed on every request.

I also believe there might be some transaction issues with this, but I can’t figure out from the documentation what they are and if they are a big deal for a single-database application.

Update: Turns out, it’s not quite this simple. Since this configuration is running outside an EJB container, and given Bug $2382, you can query all day long, but you cannot persist. To solve this, you must work in a transaction, as so:

EntityManagerFactory emf =
    Persistence.createEntityManagerFactory("name used in persistence.xml");
EntityManager em = emf.createEntityManager();
EntityTransaction tx = em.getTransaction();

tx.begin();
Query query = em.createQuery("from Account where name = :name");
query.setParameter("name",itsAccountName);
List results = query.getResultList();

// modify your results somehow via persist()
// or merge()

tx.commit();
em.close();
emf.close();

Again, this is not production code as no error handling has been done at all, but you get the point.

Git and SVN: connecting git branches to svn branches

Monday, April 28th, 2008

Currently working on a project where Subversion is the CM system of choice. I’d like to use git, as it’s faster and doesn’t require so much network access. Plus, I’m hoping when it comes time to merge, I can simplify the entire process by using git’s allegedly superior merging technique. At any rate, I’ve got a branch on SVN to work on, and I want to track both that branch and the entire svn tree.

Saturday morning, I did a git-svn init from their repository. Today, after lunch, it finished. After doing a git-gc to clean up the checkout, it wasn’t clear how to connect branches. Following is what I did (assume my subversion branch is branches/FOO):

git-checkout -b local-trunk trunk
git branch local-foo FOO

The first thing creates a new branch called “local-trunk” started at “trunk” (which is the remote branch mapping to the subversion main trunk). The second command creates a new branch called “local-foo”, which is rooted at remote branch “FOO”. I have no clue why I couldn’t do the same thing twice, as both commands seem to do the same thing (the first switches to the branch “local-trunk” after creating it). But, this is what worked for me.

Now, to develop, I git checkout local-foo and commit all day long. a git-svn dcommit will send my changes to subversion on the FOO branch. I can update the trunk via git checkout local-trunk and git-svn rebase. My hope is that I can merge from the trunk to my branch periodically and then, when my code is merged to the trunk, things will be pretty much done and ready to go. We’ll see.

On a side note, the git repository, which contains every revision of every file in the subversion repository is 586,696 bytes. The subversion checkout of just the FOO branch is 1,242,636 bytes; over double the size, and there’s still not enough info in that checkout to do a log or diff between versions.

REST Security: Signing requests with secret key, but does it work?

Monday, April 21st, 2008

Both Amazon Web Services and the Flickr Services provide REST APIs to their services. I’m currently working on developing such a service, and noticed that both use signatures based on a shared secret to provide security (basically using a Hash Message Authentication Code).

It works as follows:

  1. Applications receive a shared secret known only to them and the service provider.
  2. A request is constructed (either a URL or a query string)
  3. A digest/hash is created using the shared secret, based on the request (for Flickr, the parameter keys and values are assembled in a certain way, so that Flickr can easily generate the same string)
  4. The digest is included in the request
  5. The service provider, using the shared secret, creates a digest/hash on the request it receives
  6. If the service provider’s signature matches the one included in the request, the request is serviced

It’s actually quite simple, and for one-time requests, is effective. The problem, however, is that anyone intercepting the request can make it themselves, without some other state being shared with the client and service provider. Consider a request for an image. The unsigned request might look like:

http://www.naildrivin5.com/api/images?image_id=45&type=jpg

The signed request, would look like so:

http://www.naildrivin5.com/api/images?image_id=45&type=jpg&signature=34729347298473

So, anyone can then take that URL and request the resource. They don’t need to know the shared secret, or the signature algorithm. This is a bit of a problem. One of the advantages of REST is that URLs that request resources are static and can be cached (much as WWW resources are). So, if I wish to protect the given URL, how can I do so?

HTTP Authentication

The usual answer is HTTP Authentication; the service provide protects the resource, and the client must first log in. Login can be done programmatically, and this basically accomplishes sending a second shared secret with the request that cannot be easily intercepted. HTTP Auth has its issues, however, and might not be feasible in every context.

Another way to address this is to provide an additional piece of data that makes each request unique and usable only once. To do so requires state to be saved on the client and the server.

Negotiated One-time Token

Authentication can be avoided by using the shared secret to establish a token, usable for one request of the given resource. It would work like this:

  1. Client requests a token for a given resource
  2. Service Provider creates a token (via some uuid algorithm ensuring no repeats) and associates it with the resource
  3. Client creates a second request, as above, for the resource, including the token in the request
  4. Service Provider checks not just for a valid signature, but also that the provided token is associated with the given resource
  5. If so, the token is retired, and the resource data is returned

Here, the URL constructed in step 3 can be used only once. Anyone intercepting the request can’t make it again, without constructing a new one, which they would be unable to do without the shared secret. Further, this doesn’t preclude caching. The main issue here is that since two requests are required, simultaneous access to one resource could result in false errors: if Client A acquires a token, and Client B requests one before Client A uses the token, Client A’s token could be squashed, resulting in an error when he makes his request. The service provider can alleviate this by allowing the issuance of multiple active tokens per resource.

Timestamp

A disadvantage to the One-Time Token method is that it requires two requests of the service provider for every actual request (one to get the token and one to request the resource). A way around that is to include a timestamp in the request. This would work as follows:

  1. Client creates request, including the current time. This request is signed as per above procedure
  2. Service provider validates the request and compares it’s time with the given timestamp.
  3. If the difference in the service provider’s time and the client’s provided time is within some tolerance, the request is serviced

This obviously requires the two clocks to be vaguely in sync. It also allows the resource to be requested by anyone within the timespan of the tolerance. But, it does save a second request to the client.

Self-created One-time Token

This is an amalgam of the Timestamp solution and the Negotiated One-time Token solution. Here, the client creates its own token, as a simple integer of increasing value. The server maintains the last requested value and accepts only requests with a higher number:

  1. Client creates request, using a global long-lived number
  2. Client signs requests and sends it to the service provider
  3. Service provider validates the signature and compares the provided numeric token with the one last used (the tokens can be globally scoped, or scoped for a given resource)
  4. If the provided numeric token is greater than the previous, the request is serviced
  5. The Client increments his numeric token for next time

As with the Timestamp solution, only one request is required. As with the negotiated one-time token solution, the URL can never be used twice. The main issue here is if the client forgets its numeric token. This could be addressed with an additional call to re-establish the token, made only when the Client has determined it no longer knows the last used value.

Unfortunately, this is much more susceptible to race conditions than the Negotiated one-time token. Since the service provider doesn’t know what tokens to expect (only that they should be greater than the last requested one), the client has to ensure that the “create request, submit request, receive response, update local numeric token” cycle is atomic. That is not straightforward.

Update Got another idea from a co-worker

Session Token

When a user access the system that uses the REST API, they get issued a token (via the REST API). This token is just like a session token, with an inactivity timeout and so forth. The token can be manually invalidated via the API, so that when a user logs out or completes some logical task, the token can be invalidated.

This suffers none of the problems of the other solutions, though it isn’t the most secure. However, the security problem it has (using the valid URL before the session times out) is fairly minor, and the tradeoff of getting one request per actual request and no race conditions makes it probably the best way to go.

Distributed version control with Git for code quality and team organization

Tuesday, April 15th, 2008

In my previous post, I outlined a code review process I’ve been using with reasonably effectiveness. It’s supported, in my case, by the Git source code management tool (most known for it’s use in managing the Linux kernel). Git or, more generally, distributed development, can encourage some good quality control procedures in teams working on enterprise software. The lessons learned from the open source world (and the Linux kernel, in particular) can be applied outside the world of OSS and to the consultant-heavy world of enterprise/in-house software development.

The project I’ve been working on for the past several months has undergone what I believe to be a common change on in-house/enterprise software, which is that several new developers are being added to the project. Outside of the learning curve required with any new system, many of them are not seasoned Java developers, or are otherwise missing experience in some key technologies in use. While code reviews are a great way to ensure these developers are doing things the right way, there is still concern that their ability to commit to source control could be problematic for the entire team.

Consider a developer breaking the build, or incorrectly refactoring a key piece of shared code. A review of their commit and some continuous integration can help identify these problems, but, once identified, they must be removed from the codebase. In the meantime, the development team could be stuck with an unusable build. This can lead to two bad practices:

  • Commit very rarely
  • Get new changes from the repository only when absolutely needed

These “anti-practices” result in unreadable commit logs, difficult (or skipped) code reviews, duplication of code, and a general discoherence of the system. This is primarily due to the way most common version control systems work.

In reserved-checkout systems (e.g. PVCS, StarTeam) and concurrent systems (CVS, Subversion), there is the concept of the one true repository of code that is a bottleneck for all code on the project. The only way Aaron can use Bill’s code is for Bill to commit it to the repository and for Aaron to check it out (along with anything else committed since the last time he did so). The only way Carl can effectively review Dan’s code, or for the automated build to run his test cases, is to checkout code from the repository and examine/run it. This reality often leads to situations where each developer is operating on his own branch. The problem here is that CVS and Subversion suck at merging. This makes the branching solution effectively useless.

Enter Git. With Git, there is no central repository. Each developer is on his own branch (or his own copy of someone’s branch) and can commit to their heart’s content, whenever they feel they have reached a commit point. Their changes will never be forced upon the rest of the team. So, how does the code get integrated?

Developer’s submit their code to the team lead/integrator (who is the ultimate authority on what code goes to QA/production/the customer), who then reviews it and either accepts or rejects it. If code is rejected, the team lead works with the developer to get it accepted (either via a simple email of the issues, or more in-depth mentoring as needed). Git makes this painless and fast, because it handles merging so well.

Consider how effective this is, especially when managing a large (greater than, say, five) team of developers working concurrently. The only code that gets into the production build will have been vetted through the team lead; he is responsible for physically applying each developer’s patches (an action that takes a few minutes or even seconds in Git). Further, developers get instant feedback on their code quality. In most cases, bad commits are the result of ignorance and lack of experience. A code review, with instant feedback, is a great way to address both of those issues, resulting in a better developer and a better team, based on open, honest, and immediate communication.

Here’s how to set this up:

  1. Assign a team lead to integrate the code – this is a senior developers who can assess code quality, provide mentoring and guidance and can be trusted to put code into the repository destined for QA and production
  2. Each developer clones the team lead’s repository – This is done to baseline the start of their work
  3. Developers commit, branch, merge, and pull as necessary – Since Git makes merging simple, developer’s can have full use of all features of version control and can do so in their environment without the possibility of polluting the main line of development. They can also share code amongst themselves, as well as get updates from the team lead’s repository of “blessed” code1
  4. Developer’s inform the lead of completion
  5. Lead pulls from their repository – The lead reviews the developer’s changes and applies the patch to his repository. He can then exercise whatever quality control mechanisms he wishes, including automated tests, manual tests, reviews, etc2.
  6. Lead rejects patches he doesn’t agree with – If the patch is wrong, buggy, or just not appropriate in some way, the lead rejects the patch and provides the developer with information on the correct approach
  7. Lead accepts patches he does agree with – If the lead agrees with the patch, he applies it to his repository, where it is now cleared for QA

This may seem convoluted, but it actually carries little overhead compared to a junior developer performing a “nuclear bomb” commit that must then be rolled back. For much larger teams, the approach can be layered, with the primary team lead accepting patches only from lieutenants, who accept patches from the primary developers.

Unlike a lot of hand-wavy processes and practices, this model has been demonstrated effective on virtually every open source project. Even though the Linux kernel is one of the few to use technology to support this process (Git), every other large OSS project has the concept of “committers” who are the people allowed to actually commit. Anyone else wishing to contribute must submit patches to a committer, who then reviews and approves of their patch (or not).

I belive this would be highly effective in a professional environment developing in-house or enterprise software (especially given the typical love of process in those environments; this process might actually help!). I have been on at least three such projects where it would’ve been an enormous boon to quality (not to mention that the natural mentoring and feedback built into the process would’ve been hugely helpful for the more junior developers).


1 Git even allows a developer to merge certain commits from one branch to another. Suppose Frank is working on a large feature, and happens to notice a bug in common code. He can address that bug and commit it. Gary can then merge only that commit into his codebase to get the bugfix, without having to also take all of Frank’s in-progress work on the large feature. Good luck doing that with StarTeam.
2 A CI system could be set up in a variety of ways: it could run only against the lead’s “blessed” repository, or it could run against an intermediate repository created by the lead (who then blesses patches that pass), or it could be totally on its own and allow developers to submit against it prior to submitting to the lead.