Big Code Science

I am pitching "Big Code Science" (my take on the mashup of mining software repositories, source-code analysis, program comprehension, etc.) to an inter-faculty meeting at my university. (I am about to start an extended unpaid leave of absence to join Facebook.) I will just have 15min in a brown-bag setting and thus, I am going to use images, charts, and simple messages.

Title: Big Code Science

Abstract: Code Science is Data Science for code. Big Code Science is the scientific approach to accessing, analyzing, and understanding big data where the data here is code or data related to software development. There is several reasons why Big Code Science has taken off. (i) Open Source development has exploded in the last 10 years so that we have access to terabytes of source code, version history, developer communication, documentation, release infos, bug tracking info, etc.; not trying to learn from the past would be crazy. (ii) Big IT et al. corporations (Facebook, Google, IBM, Microsoft, Philips, Siemens, ...) critically depend on their super-huge code bases for their businesses to function and to develop further which is an extraordinary challenge because robustness, performance, security, maintainability, evolvability, and other critical parameters are increasingly harder to control when code bases grow; size does matter and science must come to the rescue. (iii) Machine learning, information retrieval, data mining, parallel programming, text analysis, traceability recovery, program analysis, reverse engineering and yet other relevant techniques have matured, also in the context of industrial scale software engineering so that we are definitely able to deal with big code both technically and methodologically. In this talk, I am going to look at a few topics that my research team have addressed in the context of Big Code Science over the last few years. I also hint at some challenges ahead -- some of which I also hope to look into during my appointment at Facebook.


Acknowledgment: This is a team effort; I am grateful to these former and current students and team members:

  • Hakan Aksu (current PhD student)
  • Johannes Härtel (current PhD student) 
  • Marcel Heinz (current PhD student)
  • Rufus Linke (former diploma  student)
  • Ekaterina Pek (former PhD student)
  • Jürgen Starek (former diploma student)
  • Andrei Varanovich (former PhD student)


Hardware lovers --- it's Christmas time!

I have enjoyed this collection long enough.

It's time to pass it on to a broader audience or more committed individuals.

Constraints for passing on stuff:
  • I like Saint-Émilion (Grand Crux specifically).
  • The stuff is located in Koblenz; I live in Bonn.
  • Pick up preferred; I can deliver to "nearby" institutional collectors.
  • Let's take photos of the hardware, collector, and me -- and post it on Instagram. 
There is this stuff:
I should also not that I have endless amounts of other legacy hardware such as phones, modems, printers, cables, and what have you. So you are encouraged to visit me in Koblenz and take stuff and leave some Saint-Émilion Grand Crux behind.


Thoughts on a very semantic wiki


101wiki started as a boring mediawiki installation to document software systems in the chrestomathy ‘101’semantic wiki extensions were quickly adopted; eventually our team developed a full blown proprietary semantic wiki sort of from scratch. Now we also rehosted it and provided it with new looks. (BTW, the 101companies brand name is now all gone. It's now just '101' really.)


The biggest mistake we (me!) made in said project ‘101’ is that we had only very loose specs for system implementation and system documentation; we had no proper process for checking and accepting contributions either. Thus, the 101wiki content was always a big mess and it still is. This problem is so serious that we switched to discouraging contributions a few years ago and rather deal with what we have and add content only when absolutely necessary. However, we depend on the 101wiki content for teaching; we also use it as a linked data hub for software language engineering-related research projects such as MetaLibMegaLib, and YAS.

With a small group of people, we are starting now a significant content and ontology-modeling push, which hopefully will lead to some islands of sanity on 101wiki. In what follows I am going to describe the rationale for what’s emerging.

Feedback more than welcome.

Semantic wiki basics

  • Typed links: Property names are used to qualify (to ‘type’) links. For instance, we use ‘sameAs’ to express that a 101wiki entity (page) is the same as some entity (page) elsewhere. Also, we use ‘uses’ to express that a contribution (a system implementation) uses some language or technology. We tend to relate to 101wiki entities (pages) to Wikipedia resources. See here for a list of 101wiki’s properties.
  • Typed pages: We organize pages in ‘namespaces’ such as 'Language', 'Technology', or 'Contribution'. We use namespace names as prefixes/qualifiers of page names. For instance, we say ‘Language:Java’ rather than ‘Java_(Programming language)’ on Wikipedia. The fact that Java is a programming language is taken care of by a semantic property. That is, Java is declared to be an instance of 'OO programming language' which is a subtype of 'Programming language'. See here for a list of 101wiki’s namespaces.
  • Bits of content management: We expect that the structure of pages can vary, in our case, depending on the namespace (the ‘type’) of page. That is, there are different sections that may be used and each type of section may come with certain expectations regarding its content. For instance, a ‘headline’ is a section that should be used by any 101wiki page while a ‘motivation’ is (currently) only expected by a page for a system 'feature'. See here for a list of 101wiki’s sections.

For instance, here is (most) of the content of 101wiki's page for the Haskell programming language:

Content for https://101wiki.softlang.org/Language:Haskell

In fact, we show the metadata section of the Haskell page separately:

Metadata for https://101wiki.softlang.org/Language:Haskell

That is, Haskell is also located on haskell.org and Wikipedia. We use 'sameAs' to express that these are all resources describing the Haskell language. There is also an 'instanceOf' property to express that Haskell is a functional programming language. 'Inbound' properties are also shown to help the user realize what other pages relate to Haskell.

Semantic wiki self-description

  • Link types are to be declared on the wiki itself: This means, in our case that, there is a type (a ‘namespace’) of properties. It also means that there are ‘meta-properties’ dealing with the properties of properties. That is, each property, just like in Semantic Web, has a domain and a range.
  • Pages types are to be declared on the wiki itself: This means, in our case, that there is a type (a ‘namespace’) of namespaces. It also means that there are ‘meta-properties’ dealing with the properties of (pages as members of) namespaces. That is, each namespace associates with mandatory and optional sections and properties. Accordingly, there is also a type (a ‘namespace’) of sections.
  • Link endpoint types are to be declared on the wiki itself: This means, in our case, that there is a type (a ‘namespace’) of types. There is basically a type for each 101wiki namespace, but there are additional types such as ‘String’ for string-typed properties, ‘URI’ for reaching out of 101wiki, and ‘Any’ to refer to the union of all 101wiki namespaces.
For instance, these are the properties for the namespace of languages:

Metadata for https://101wiki.softlang.org/Namespace:Language

That is, the namespace relates to the concept of 'software language'. Each page in the namespace, must have a 'headline' as well as a section with metadata; it may have sections 'details', 'quote', and 'illustration'. The metadata must at least exercise the 'instanceOf' property for classification. The 'exemplifiedBy' property at the bottom of the figure is a bit special; we discuss it just below.

Semantic wiki quality monitoring

Given how much messy content there is on 101wiki, given how difficult it still is to agree on semantics of page and link types, we are starting to use one magic property, ‘exemplifiedBy’, to designate 101wiki pages that are reasonably representative of a type (a namespace, a property, a section, etc.). This helps the team to consult these exemplars in trying to migrate more legacy to an emerging 'metamodel'. The metadata for the property is mind-boggling.

Metadata for https://101wiki.softlang.org/Property:exemplifiedBy

That is:

  • The page describing the property is linked to the notion of Exemplar.
  • Subjects of the property maybe a namespace, section, or property page. That is, these kinds of pages can be 'exemplified'.
  • Objects of the property maybe pages in 'any' namespace. This is a bit weakly typed because, we expect of course that an exemplar for namespace should be a page in the namespace. (So basically 101wiki's type system is not powerful enough to capture all details.)
  • It so happens that the property page for 'exemplifiedBy' itself is a feature page for the property; see 'this exemplifiedBy this'.
  • We also see how the use of the property is documented in the 'metamodel' of the namespaces namespace, section, and property. 


I take responsibility for the content mess on 101wiki, but I like to acknowledge some people who have contributed or are contributing to 101 in a significant way, despite my epic failure. Hopefully this acknowledgment will not be used against them :-)

  • Andrei Varanovich (former developer and content author)
  • Thomas Schmorleiz (former developer)
  • Kevin Klein (the incredible current developer)
  • Marcel Heinz (current content author and ontologist)
  • Johannes Härtel (current content author and data miner)
  • Hakan Aksu (current content author and educator)
  • Wojciech Kwasnik (the team's logo artist acknowledged here)

The logo of '101': it hints at the Tower of Babel and how the project illuminates hopefully the knowledge area of software languages, technologies, and concepts on the grounds of an advanced chrestomathy approach .



Peano goes Maybe

Just for the fun of it, let's represent Nats as Maybies in Haskell.

import Prelude hiding (succ)
-- A strange representation of Nats
newtype Nat = Nat { getNat :: Maybe Nat }
-- Peano zero
zero :: Nat
zero = Nat Nothing
-- Peano successor
succ :: Nat -> Nat
succ = Nat . Just
-- Primitive recursion for addition
add :: Nat -> Nat -> Nat
add x = maybe x (succ . add x) . getNat
-- Convert primitive Int into strange Nat
fromInt :: Int -> Nat
fromInt 0 = Nat Nothing
fromInt x = succ (fromInt (x-1))
-- Convert strange Nat into primitive Int
toInt :: Nat -> Int
toInt = maybe 0 ((+1) . toInt) . getNat
-- Let's test
main = print $ toInt (add (fromInt 20) (fromInt 22))

I wrote this code in response to a student question, whether and, if so, how one could code recursive functions on maybies. This inspired me towards the exam question as to how the above code compares to more straightforward code which would uses an algebraic datatype with Zero and Succ constructors instead of maybies.


An ontological approach to technology documentation

SE talk at Chalmers, Gothenburg, Sweden

An ontological approach to technology documentation

Room 473 / Wed March 1 - 11:00 - 12:00 

Speaker: Ralf Lämmel, University of Koblenz-Landau

Abstract: In this talk, I am going to present an ontological approach to software technology documentation. That is, usage scenarios of a technology (such as an object/relational mapper, a web-application framework, or a model transformation) are captured in terms of the involved entities (e.g., artifacts, languages, abstract processes, programming paradigms, functions, and function applications) and the relationships between them (e.g., membership, conformance, transformation, usage, and reference). I am going to discuss language and tool support for and experiences with developing such technology documentation. In the SoftLang team at Koblenz, we work on the related but broader notion of "linguistic software architecture" or "megamodeling". I will briefly discuss applications of megamodeling other than technology documentation, namely build management and regression testing. More information: http://www.softlang.org/mega

Slidesin preparation


The Haskell Road to Software Language Engineering and Metaprogramming

FP talk at Chalmers, Gothenburg, Sweden

The Haskell Road to Software Language Engineering and Metaprogramming

2017-02-24, 10.00, conference room 8103, Rännvägen 6, Johanneberg.  

Speaker: Ralf Lämmel, University of Koblenz-Landau

In this talk, I would like to sketch my upcoming textbook on software languages http://www.softlang.org/book while putting on the hat of a Haskell programmer. Overall, the book addresses many issues of software language engineering and metaprogramming: internal and external DSLs, object-program representation, parsing, template processing, pretty printing, interpretation, compilation, type checking, software analysis, software transformation, rewriting, attribute grammars, partial evaluation, program generation, abstraction interpretation, concrete object syntax, and a few more. Haskell plays a major role in the book in that Haskell is used for the implementation for all kinds of language processors, even though some other programming languages (Python, Java, and Prolog) and domain-specific languages (e.g., for syntax definition) are leveraged as well. I hope the talk will be interactive and help me to finish the material and possibly give the audience some ideas about FP- and software language-related education and knowledge management.



Software Language Book ready for review and limited access

I am happy to be done with the draft of the software language book. Just sent it off to Springer for the final verdict/review.

Please find the book's frontmatter (including table of contents, preface, and acknowledgment) as well as the first technical chapter ("The notion of software language") online:


If you like to review the draft book or use it already in class room, please get in touch. The draft is sent to Springer and I hope to receive Springer's Ok+input within three months and finalize the book accordingly no later than May 2017. While under review and further scrutiny, I am going to perform self-motivated proof-reading and fine tuning. (There are some obvious dimensions for the final mile: index, exercises, English, formatting, clarity, bibliography.)

I am going to have a sabbatical Mid of February - End of October 2017. I am super-über-motivated to visit a few places, give guest lectures drawn from the book, and discuss use of the book in teaching, and, of course, engage on research along the way. No matter what, there is going to be extensive slide, video, and code material complementing the book by the end of summer.

All the code that's in the book and a lot more is available online anyway:


BTW, the book's repo is megamodel-managed; see here:

Happy New Year

PS: Just received word from Springer regarding schedule:
  • 1-15 January: Identification of reviewers :-)
  • 15 January - 7 April: Reviewing
  • 7 April - 15 May: Revision
  • 15 May - 1 July: Copy editing
  • 1 July - 1 October: Production
  • 23 - 24 October: Outing at SLE 2017 conference