Anatomy of a Scientific Research Paper (Part 2): The Dreaded Introduction

The best place to start writing is the introduction.  Many people dread writing the introduction though — it is the first place that reviewers will decide to reject your paper, so you have to get it right.  It takes three things.  One is a strong understanding of your research project: what problem are you solving and why.  Your advisor should help your with this, and I will write another post on how to best understand your own research.  Another “thing” is a strong understanding of your audience, so you know what you need to explain and what you can skip.  This understanding comes in time after reading many papers in your field.

The last “thing” it takes to write a good introduction is to organize your thoughts in the way a reader is expecting.  This organization is rather mysterious in the sciences, so I am going to explain it now.

Your reader is expecting the following:

  1. A description of the high-level domain / environment
  2. A key problem that happens in this high-level domain
  3. Current solutions to that key problem
  4. A problem with the current solutions
  5. A glimpse at how you solve the problem in (4)
  6. An overview of your evaluation / experiment results

That’s it!  If you write a few sentences on each of these, you will have your 1-page introduction in an acceptable format.  Now, the “problem” may be a real problem practitioners have, an unsolved technical problem, or even a “knowledge problem” (such as an issue that has not been studied and explained in the literature yet).  But the principle is almost always the same.  Let’s look at each item.

High-level domain / environment

This is where you connect your research to a real-world problem.  Do not get flowery here, this is not Shakespeare and it is not a speech.  Start with a definition of your problem.  If you are describing a new technique for growing drought-resistant corn, a good starting sentence might be “Drought-resistant corn is…”  If you are describing a study of lightning bolts and tree mortality, a good starting sentence might be “Tree mortality from lightning strikes is…”  Etc.  Answer questions which are obvious to you, but non-obvious to others, such as whether tree mortality has to take place immediately after the strike, or within 2 days, or what.

Take a look at this paper I co-authored this year:

Panichella, A., McMillan, C., Moritz, E., Palmieri, D., Oliveto, R., Poshyvanyk, D., and De Lucia, A., “Using Structural Information and User Feedback to Improve IR-based Traceability Recovery”, in Proceedings of 17th European Conference on Software Maintenance and Reengineering (CSMR’13), Genova, Italy, March 5-8, 2013, pp. 199-208. [PDF]

From paragraph 1:

Traceability recovery is a key software maintenance activity in which software engineers extract the relationships among software artifacts.  These relationships (called “traceability links”) are a valuable resource during software maintenance because they provide a connection from high-level software documents such as use cases to low-level implementation details, such as source code and test cases [3].

“Traceability recovery is…”  The reader immediately knows what this paper is about.  Even if you don’t know what traceability recovery is, you now know what you need to know in order to understand the paper.

Problem in high-level domain

So you’ve explained the high-level domain.  Great.  So what’s the problem in that domain?  Spend some space explaining it in simple language.  Perhaps the problem in creating drought-resistant corn is that water evaporates quickly from the leaves.  So you would say “Unfortunately, drought-resistant corn is difficult to breed because water evaporates quickly from the leaves.”  Here’s the example from my paper:

Unfortunately, traceability links are notoriously difficult to extract from software [3, 13, 28].  Software engineers must read and understand different artifacts to determine whether a link exists between two artifacts.  Meanwhile, the artifacts are constantly being modified in the midst of an evolving software system.  Maintaining a list of up-to-date traceability links inevitably becomes an overwhelming, error-prone task.  Automated tools for traceability recovery offer an opportunity to reduce this manual effort and increase  productivity.

Traceability links are hard to extract.  Don’t believe us?  Check out these related papers.  It is extremely expensive to do by hand.

Current solutions

The problem is well-known, so there must be current techniques to address it.  You may think of these current solutions as your competition, or against what you might evaluate your new technique.  Even if the only solution is manual.  In the case of a literature “knowledge problem” paper, the current solutions may be papers that partially answer the question, or address similar questions.  In the corn example, maybe the current strategy is to breed plants with thicker leaves that dry out less quickly.  Or, in our paper about traceability links:

Information Retrieval (IR) [5] has gained wide-spread acceptance as a method for automating traceability recovery [3, 13, 21, 28].  The IR-based methods, such as those based on Vector Space Model (VSM) [5] or probabilistic Jensen and Shannon (JS) model [1], identify traceability links using the textual information from the software artifacts.  For example, the keywords from documents describing use cases may match keywords in the comments of a source code file.  Textual information has the advantage of being widely available…

Problems with current solutions

The current solutions ain’t all ice cream and lollipops.  They have their own problems.  Sometimes severe problems.  Talk about them.  Maybe breeding corn plants with thick leaves causes the corn ear to be smaller, because energy is used by making leaves instead of grain.  Say so.  From our traceability paper:

…but it is unfortunately also highly subjective.  Words may have multiple meanings, identifiers from software are often misleading if taken out of the context, and comments are frequently out of date [2].  Different strategies have been successful in improving IR-based methods, including text pre-processing (e.g., [37, 39]), smoothing filters [12], and combinations of these approaches [17].  Nevertheless, imprecision remains a major barrier to using IR for traceability link recovery in practice.

Notice the pattern of problem, solution, problem-with-solution, solution-to-problem-with-solution, etc.  This can continue a couple times until you get to your point:

Structural information contained in source code (e.g., function calls or inheritance relationships) has been proposed in solutions to increase the precision of IR-based traceability recovery [31].  In general, a combined approach will use an IR-based method to locate a set of candidate links, and then either augment or filter the set of links based on the structural information.  However, combined approaches tend to be sensitive to the IR method. If the candidate links are correct, then the structural information can help locate additional correct links. Otherwise, the structural information offers little help, or will even pollute the results with incorrect links.

Your solution to the problem-in-the-current-solution

Finally you describe what you did.  Take a paragraph to tell your reader the key idea behind your approach.  Maybe you bred corn plants with deeper roots instead of thicker leaves.  Use simple language to tell the world.  Here’s our key idea:

Our conjecture is that the traceability links recovered by IR-methods should be verified by software engineers prior to expanding the set of links with structural information.

Then we went on to give a brief example to illustrate our answer.  An example is one way, citing supporting literature is another way.

Summary of experimental results

Now explain how you evaluated what you did and what results you saw.  Remember you only have a few sentences here.  Maybe you planted both your variety of corn and a competitor’s corn in quarter-acre test plots in 5 different counties.  Say so briefly.  Then say what your results were.  Such as if you found corn ears 5% larger, or whatever.

Follow my simple formula here and you will find your introductions much easier to write, and much easier for reviewers to accept!  Of course there are many factors to a good paper, but my formula will get you started.

Leave a Reply

Your email address will not be published. Required fields are marked *