In my field many scientists tend to not publish the code nor the data. They would mostly write a note that code and data are available upon request.
I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.
But on the other hand it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc). Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0], making explicit what everyone understands, viz.:
"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"
Benefits: people who want to reproduce your analysis can use exactly the right software, and people who want to build on your work can find the latest in your repo. Either know how to cite your work correctly.
In practice drive-by nitpicking over coding style is not that common, particularly in (some) science fields where the other coders are all other scientists who don’t have strong views on it. Nitpicks can be easily ignored anyway.
BTW should you choose to publish, the Turing Way has a section on software licenses written for researchers: https://the-turing-way.netlify.app/reproducible-research/lic...
* You increase the impact of your work and as a consequence also might get more citations.
* It's the right thing to do for open and reproducible research.
* You can get feedback and improve the method.
* You are still the expert on your own code. That someone picks it up, implements an idea that you also had and publishes before you is unlikely.
* I never got comments like "you could organize the code better" and don't think researchers would tend to do this.
* Via the code you can get connected to groups you haven't worked with yet.
* It's great for your CV. Companies love applicants with open-source code.
No, this is almost never the case. It should be. But it cannot really be. There are always more details in the code than in the paper.
Note that even the code itself might not be enough to reproduce the results. Many other things can matter, like the environment, software or library versions, the hardware, etc. Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.
And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).
Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).
> In my field many scientists tend to not publish the code nor the data.
This is bad. But this should not be a reason that you follow this practice.
> clean and organize the code for publishing
This does not make sense. You should publish exactly the code as you used it. Not a restructured or cleaned up version. It should not be changed in any way. Otherwise you would also need to redo all your experiments to verify it is still doing the same.
Ok, if you did that as well, then ok. But this extra effort is really not needed. Sure it is nicer for others, but your hacky and crappy code is still infinitely better than no code at all.
> it will increase the surface for nitpicking and criticism
If there is no code at all, this is a much bigger criticism.
> publishing the code will be removing the competitive advantage
This is a strange take. Science is not about competing against other scientists. Science is about working together with other scientists to advance the state of the art. You should do everything to accelerate the process of advancement, not try to slow it down. If such behavior is common in your field of work, I would seriously consider to change the field.
My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.
You'll open yourself up for comments. They may be positive or negative. You'll only know how it pans out afterwards.
Is the code something that you'll want to improve on for further research? If so publish it on github. It opens the way for others to contribute and improve the code. Be sure to include a short readme that you welcome PRs for code cleanup, etc. That way you can turn comments criticizing your code into a request for collaboration. It'll really separates helpful people from drive by commenters.
On the other hand, if your goal is only to advance your own career and you want to inhibit others from operating in this space any more than necessary to publish (diminish your “competitive advantage”) then I guess you wouldn’t want to publish.
Anyone who programs publicly (via streaming, blogging, open source) opens themselves up for criticism, and 90% of the time the criticism is extremely helpful (and the more brutally honest, the better).
I recall an Economist magazine author made their code public, and the top comments on here were about how awful the formatting was. The criticism wasn't unwarranted, and although harsh, would have helped the author improve. What wasn't stated in the comments is that by publishing their code, the author already placed themselves ahead of 95% of people in their position who wouldn't have had the courage to do so. In the long run, the author will get a lot better and much more confident (since they are at least more aware of any weaknesses).
I'd weigh up the benefits of constructive (and possibly a little unconstructive) criticism and the resulting steepening of your trajectory against whatever downsides you expect from giving away some of your competitive advantage.
I've always published my research code. Thanks to that, one of the tools I wrote during my PhD has been re-used by other researchers and we ended up writing a paper together! In my field is was quite a nice achievement to have a published paper without my advisor as a co-author even before my PhD defense (and it most likely counted a lot for me to get a tenured position shortly after).
The tool in question was finja, an automatic verifier/prover in OCaml for counter-measures against fault-injection attacks on asymmetric cryptosystems: https://pablo.rauzy.name/sensi/finja.html
My two most recently published papers also come with published code released as Python package:
- SeseLab, which is a software platform for teaching physical attacks (the paper and the accompanying lab sheets are in French, sorry): https://pypi.org/project/seselab/
- THC (trustable homomorphic computation), which is a generic implementation of the modular extension scheme, a simple arithmetical idea allowing to verify the integrity of a delegated computation, including over homomorphically encrypted data: https://pypi.org/project/thc/
As the other comment said, if you care about "advancing the science", and won't mind stuff like the above happening, then go for it. In my experience, it is not worth it.
* I have never had someone come back to criticize my code style. And if they do, so what? I'll block them and not think about it again. I don't need to get my feathers ruffled over this.
* Similarly, if someone's trying to replicate my results, and they fail, it's on them to contact me for help. After that it's on me to choose how much effort to put into helping them. But if they don't contact me, or if they don't put in a good faith effort to replicate the results, that's their problem. If they try to publish a failure to replicate without having done that, it's no more valid science than publishing bad science in the first place.
Overall, I think most people who stress about publishing code do so because they haven't done it before. I've personally only ever had good consequences from having done so (people picking up the code who would never have done anything with it if it weren't already open source).
No, it isn't.
Reproducing the results means that you provide the code that you used so that people can reproduce it just by running "make" (or something similar). If you do not publish the code and the input data, your research is not reproducible and it should not be accepted in a modern, decent world.
It doesn't matter that your code is ugly. Nobody is going to look at it anyway. They are only going to call it. If the code is able to produce the results of the paper with the same input data, that's enough. If the code is not able to at least do that, this means that even you are not able to reproduce your own results. In that case, you shouldn't publish the paper yet.
This was kind a change for my advisor who was definitely less interested in that aspect of research. I think this is an issue in academia and needs to change.
Also, ultimately if someone wants to copy and publish your work as their own it will be relatively easy to show that and the community as a whole will recognize it.
Also, for me it felt good when another student/researcher was aided by my work.
https://shankarkulumani.com/publications.html
You don't need to clean it up or make the code presentable. Everyone knows it's research grade code. Most important part is that you have the code in a state that you can reuse in the future for another publication.
I've been saved multiple times by being able to easily go back to decade old work and reproduce plots.
I hypothesize that you will see some combination of three effects: (1) you will get lots of downloads (which means people are using your code, good work!), perhaps with lots of follow-up emails and perhaps not depending on what the code does; (2) you will get lots of emails from random nutjobs looking to pick holes in your work, and you will waste your time answering them; (3) you will get almost completely ignored.
Whatever the outcome, I think a lot of people would be interested in to hearing about what you learn.
https://www.usenix.org/conference/usenixsecurity22/call-for-...
I'll add: I think that we need to change the mindset in academia about code. If code was involved in producing the results in the paper that code should be considered part of the paper and (at least) as important as the text of the paper. (Same for data)
The mathematicians and computer scientists I've worked with generally wrote more complicated code, but from a bugginess and maintainability standpoint I'm not sure it was any better. I had a mentor with an applied math degree who was extremely fond of one and two character variable names.
Just publish it. Unless your paper is a _BIG_DEAL_ barely anybody is going to look at it, and some people (hopefully the right people) will respect you for showing your work. I think I'm one of the few reviewers that actually try to run and maybe glance at the code for papers I review. In the papers I've reviewed I've never seen a comment that indicated any of the other reviewers even looked at it.
In a very real sense unless the paper has a result that's so compelling I can't ignore it if there's no published source code -- even if it's an obvious prototype! -- I'll pass it by. I'm not alone in that in my line of work. Industry folks might also be more willing to accept prototype code than academic folks, I dunno.
Worth consider, I guess, if you're interested in your work crossing the academic/industry boundary smoothly.
Publishing research code is admirable, and in an ideal world everybody would publish their code and data. That said, we shouldn't pretend that there aren't tradeoffs. Time spent polishing your code to make it presentable is time not spent on other aspects of your research. Time spent developing software development skills is time that could be spent learning new research techniques (or whatever). Reproducible research is great, but it's certainly possible to take it too far at the expense of your productivity/career.
You should also take your own personality into account. If you're a perfectionist you might struggle to let yourself publish research-quality code rather than production-quality code and consequently over-allocate the time you spend prettying up your code.
BUT, I have definitely encountered the situation where I read a paper, then looked at the associated code, and found that the exciting result was entirely because of a bug. The reputation, "This investigator is someone who does shoddy, error-prone work" is probably the worst possible one.
As an example, I've found a paper that promises a method to do the very thing I want to accomplish. It's not too dense but it skips a few crucial moments and I've been working on coding the method for a year now (on and off, of course but still for a long time). If the code was available it probably wouldn't take as long. The paper didn't mention that the code was available upon request but it was implemented in a piece of software. I've found it eventually but it was a version just before the feature I'm after was added. I tracked the author and they were great sport about cold emails bet didn't have the source any more.
So yes, please publish the code. You don't have to clean it up. It worked for the paper — it's good enough. Even the most terrible code is immeasurably better than no code.
Reproducibility -- I dunno. A re-implementation seems better for reproducibility. The paper is the specific set of claim, not the code. If there are built-in assumptions in your code (or even subtle bugs that somehow make it 'work' better), then someone who "reproduces" by just running your code will also have these assumptions.
Coding time -- are you sure? Professional coders are pretty good. If you have, for example, taken the true academic path and written your code in FORTRAN, there's every chance that a professional could bang out a proof of concept in Python or C++ in like a week (really depends on the type of code -- EIGEN and NUMPY save you from a whole layer of tedium that BLAS and LAPACK 'helpfully' provide). Really good pseudocode might be more useful than your actual code
Another note -- personally I treat my code as essentially the IP of my advisor. (He eventually open sources most things anyway). But do check on the IP situation if you want to open source it yourself. If you are working as a research assistant, some or all of your code may belong to you University. They probably don't care, but it is better to have the conversation before angering them.
You're supposed to welcome criticism and 'nitpicking' as a scientist.
1. It gives your work more visibility. If there is a easy git clone route to reproducing your work, it offers a low effort starting point for people to build upon your work which means they are more likely to use it. Plus you get free citations from anyone who touches it.
2. There is no reason that people should be hoarding code in academia, and the only reason people do it now is a sort of prisoners dilemma problem (first person to publish their code had to start from scratch, so they feel possessive and let it die when they graduate). Every researcher who releases their code chips away at the problem and pushes the community to be more open with their code which is intrinsically more efficient.
3. If you get lucky and the community adopts your code it will be viewed very positively by potential future career advancement committee being 'they guy who wrote _x'
4. When I started in academia I based my codebase on an existing publicly available code, which saved me a huge amount of time in my work. I built upon it (not expanding the base code, but using it as a module to integrate experimental measurements to the simulations tools I wrote from scratch) in my PhD and when I graduated I handed a virtualbox image with the whole mess (yay free code--wouldn't have been possible with nonfree code) off to my successors, people in new groups, etc which ended up being the base of an entire new research group at a different university. Every once in a while I get an email asking for help, and get a notification saying that someone cited the code.
Personally, I would. Open source is a form of peer review, and if you're wanting to stand by your paper as peer-reviewable then I believe the code should be included in that. Generally speaking, I feel more researchers need to open up their code to peer review because generally speaking, research code tends to not have the same robustness against mistakes (through coding convention as well as tests) as professional software development. I shudder to think how many papers have flawed results that no one realises and are just accepted, because no one can spare the effort rebuilding the code from scratch and without any prior reference in order to verify said results.
I don't think you need to clean it up. You're not competing for a coding elegance competition, but rather allowing someone to find bugs if they exist and point it out, just as they would peer reviewing your paper.
More cynically, spaghetti code probably helps as a defense against people ripping off your code, so if you're worried about your competitive advantage then not cleaning it up is a form of security through obscurity :)
Separate from that, is there fairly new chatter in your field about reproducible science, publishing code and data, etc.? If so, what's the current thinking there about how valuable this is to collective science, and how that should affect the sometimes unfortunate conflicts of interest between career and science?
But it is more honest. Whatever you think about the effort required to do this, there's value in honesty.
Here is an example of my own scientific work:
- paper [0]
- preprint [1]
- GitHub [2]
It certainly wasn't easy to get all of this done. But doing this can also be a guide for others. They get to see exactly what you've done so that they don't waste months on the exact implementation. They can see where maybe you've made some mistakes to avoid them. They can see so much of the implicit knowledge that is left out of your paper and learn from it. Your code isn't going to be perfect, but what paper is, either?
Everyone will be a critic, anyway, so make it easy to pick up criticism of the stuff you feel the least confident in and do better next time. You won't get better if no one sees your code.
[0]: https://cancerres.aacrjournals.org/content/81/23/5833
[1]: https://www.biorxiv.org/content/10.1101/2021.01.05.425333v2
[2]: https://github.com/LupienLab/3d-reorganization-prostate-canc...
At the time there were two widely used software packages for phylogenetic inference, PAUP* [2] and MrBayes [3]. The source code for MrBayes was available, and although at the time I had some pretty strong criticisms of the code structure, it was immensely valuable to my research, and I remain very grateful to its author for sharing the code. In contrast the PAUP* source was not available, and I struggled immensely to replicate some of its algorithms. As a case in point, I needed to compute the natural log of the gamma function with similar precision, but there was no documentation for how PAUP* did this. I eventually discovered that the PAUP* author had shared some of the low-level code with another project. Based on comments in that code I pulled the original references from the 60s literature and solved these problems that had plagued me for months in a matter of days. Now, from what I could see in that shared PAUP* code, I suspect that the PAUP* code is of very high quality. But the author significantly reduced his scientific impact by keeping the source to himself.
[1]: https://github.com/canonware/crux
Now both of the researchers have to be cited, but only one of them did the discovery work.
From Heil et al. (https://www.nature.com/articles/s41592-021-01256-7):
> Documenting implementation details without making data, models and code publicly available and usable by other scientists does little to help future scientists attempting the same analyses and less to uncover biases. Authors can only report on biases they already know about, and without the data, models and code, other scientists will be unable to discover issues post hoc.
Even better would be to containerize all software dependencies and orchestrate the analysis with a workflow manager. The authors of the above paper refer to that as "gold standard reproducibility"
You have limited time. I'd prioritize that time on what you think others will find useful.
Don't worry about ugly code. There are research codes with 1k+ stars on GitHub that are ugly. They have so many stars because people find them useful.
You absolutely don't have to publish your code, or anything else of that matter. Don't let the the drive for impact on the community force you into working on something you're not interested in.
Congrats on your publication.
Research based on or involving code/models/algorithms should always be accompanied by a code drop. Nobody expects the code to be of good quality.
Everything else is not reproducible - and against the scientific codex (IMO).
I read so many papers that claim incredible results, and wondering how they implemented their models in this particular simulator (close to impossible with only what is out there), only to find that there is just nothing to be found, anywhere. No repo, no models, no patch. NIL.
Sending an E-Mail? No response.
Further, anyone could just claim anything this way. Why bother doing any real work?
What if there is a small error in the code?
Wouldn't it be better to know that? In a scientific sense, searching for "the truth"?
If it's uncommon to release code then I'd doubt anyone in the peer review will review it.
It's better than nothing, it also is the only way for others to reproduce your results. I am surprised you were not asked to do that by whatever journal you chose to publish your results.
>many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage
LOL, what!? What is this crap about "competitive advantage"? Are you privately funded? Then it's fine. If you're funded by public (i.e. government) money, you are (at least ethically) obliged to share your work with everybody.
Computer engineering on novel systems is a bit harder, but a /complete/ spec of the system (enough for someone to precisely rebuild it) should be published in that case. Remote access on request to the prototype would be better.
Regardless of whether or not you release the code, you should do this.
It’s so common for people to think that cleaning/refactoring/documenting code is a waste of time, but it’s exactly the opposite.
The point at which the code is working, but not yet polished is exactly the prime “teachable moment” for improving your skills as a programmer and for refining your knowledge of the domain the program solves for. (This is true no matter how skilled or knowledgeable you already are).
Your brain is perfectly primed to do this now, so don’t let that go to waste.
Some papers link to the code instead of including it. Maybe I'm just unlucky, but this usually leads to dead links (but that's a different topic altogether).
There is a nice Perspective piece in Science from 2011 [1] touching on the question of cleaning up the code. It suggests basically the same thing as several of the comments in this thread: if you don't have time or motivation to clean up the code, don't.
"even incremental steps would be a vast improvement over the current situation. To this end, I propose the following steps (in order of increasing impact and cost) that individuals and the scientific community can take. First, anyone doing any computing in their research should publish their code. It does not have to be clean or beautiful (13), it just needs to be available. Even without the corresponding data, code can be very informative and can be used to check for problems as well as quickly translate ideas. ... The next step would be to publish a cleaned-up version of the code along with the data sets in a durable non-proprietary format."
[1] Peng (2011) Science 334 1126-1127 https://doi.org/10.1126/science.1213847
There is value in scrutinizing the code - not w.r.t. coding styles or standards but to discover bugs in the implementation, which are very common. Scientists are only human, and scientific software is less often checked by a second pair of eyes. There is also value in trying to replicate a study from scratch with a fresh implementation only from the details in the paper. Many conferences, for instance the European Conference on Information Retrieval (ECIR), Europe's largest scientific search technology conference, has a replication track only for replication papers, and these are often the most interesting/insightful papers. It occasionally happens that a result is not caused by what the authors think, but is merely an artifact of the implementation code. A very famous MIT researcher (not naming him or her here on purpose) fell into this trap in their Ph.D. thesis, but it can happen to anyone, really. Scientific results become objective knowledge as others solidify the body of knowledge by carrying out replications and arriving at the same results.
Whatever your decision about past code, going forward, if you plan to release all future research code, you will likely write better code in the first place, as you will constantly be aware that people will be looking at it, and that can only be a good thing.
In the field I follow the most (Computer Graphics/Rendering) I think there is a big problem with reproducibility as well, and to be honest, I think some of the major players actually have little interest in making this significantly better, since they can take advantage of the visibility of a flashy render/fps counter shown at an event while still keep on building a "moat" between them and others that want to adopt the same methods.
Which is in the end partly an answer to your question: your paper could clearly describe all the elements needed to implement a method correctly, but by providing a sample implementation you allow others to "stand" on your shoulders, as they say, instead of having to climb there first and then proceed. You can not worry too much about the state of your codebase by making clear via README/documentation/license that it's still in "proof of concept" phase.
One reasonable observation I have heard is that in some fields, during peer review, some reviewers seem to like to nitpick on the code rather than the paper, sometimes in subtle ways. Because of that, I think it can be (unfortunately) OK to release the code after acceptance or publication. But apart from this, I see only advantages.
(FWIW, I'm a professor at an R1 university. I give this advice to all of my Ph.D. students and strongly, strongly encourage them to put their code out there on our github.)
It is better for science, it will be better for you and it will be better for people who want to play with your code.
Publishing is a form of advertising what you did, and helping others reproduce it makes it go viral and is a testament to how much they care. It can only help your career.
You’ll definitely get people who nitpick the code. This won’t hurt and it may even help in its own way.
As an outsider looking in, many academic fields seem to have a reproducibility crisis. Many psychological studies, for example, cannot be reproduced yet they continue to be cited.
I personally feel like every academic paper should be reproducible. I should be able to email you the study and you should get the same results. Obviously clinical trials may vary (and thus the important of statistical significance) but the real problem is data and models. If I, as someone reading your study, don't have your data, how can it possibly be reproduced? If I gather my own data will I get completely different results? If I'm solely relying on what details you give, how do I know you haven't made a fatal assumption or even just buggy code with your model?
I personally feel like a condition of all Federal funding should be that the data and any code should be made freely available.
So I support the idea of releasing it and that releasing something messy is better than releasing nothing but I can't speak to your individual circumstances.
I left the academic world a few years ago, but several of the analysis codes/models I published (either as stand-alone tools or artifacts published alongside a journal article) still regularly get used... if anything, there's probably a larger user base for one of my models today than there ever has been, and it's leading to a long-tail of publications where my initial work is either cited or I'm offered co-authorship when I have time to offer hands-on support for improving the model/code and offering my insight as a domain expert.
If you can take the time to clean up some code or author a lightweight package, that's amazing! But it's a bang-for-your-buck type thing. If you ever aspire to leave academia, it's undoubtedly worth spending some time to clean up the code, add documentation, add some unit tests, etc - great artifacts to use in supporting a hiring process if you move into a technical role somewhere in industry. But is far from necessary.
You can embed this to the PDF, e.g. see section A.1 [1] for how.
[1]: https://raw.githubusercontent.com/motiejus/wm/main/mj-msc-fu...
Then answer any criticism about it by asking for a PR.
To preempt code style complaints find a code formatter for your language and run everything through that first.
Refer to the repository in your paper, but don't put a link. Create a little bit of friction to get to the repo to discourage the casual readers who don't really need the code from popping over too easily.
I've heard this claim so many times, from many an author who had their brain so deep in the problem they were working on that they were 100% incapable of properly gauging the validity of this claim.
To verify that what you claim is true, wait two years to give your brain time to flush the context, pick your research paper up (and nothing else that wasn't made available to others) and try to reproduce the results on a brand new computer without any of the environment your developed your research with.
See how much blood you end up sweating.
PLEASE publish your research code. Don't worry about it being disgusting and hackish, it's research code, so by definition, no one expects it to be industrial strength.
Don't spend time cleaning it up either, your time is better spent on doing more research.
If you feel responsibility towards the community:
- put a huge disclaimer at the start of the README explaining what a mess the whole thing is *because* it's research code.
- if your really must: list requirements and provide a build.sh
For the journal I edit, authors are required to include the code and data with the submission. The code and data are available along with the paper if it's published. We do replication audits of some papers to make sure you can take the materials they've included and reproduce every result in the paper. If not, the conditional acceptance changes to rejection. I've had cases where reviewers found errors in the code, so I rejected the paper.
On the argument that it's a competitive advantage: what does that mean? You should be able to claim results but not show where they came from? That's not science.
Keep in mind that this is a "source available" requirement, not an open source requirement. It is a matter of transparency. You have to let others see exactly what you did.
What I would not expect from people is code that would necessarily run in your environment. For example, in many cases, the paths are going to be hard-coded, for a variety of reasons. It might be ideal to write code that will just work, in a reproducible environment, but that often takes more work than people are willing to commit to, given all the other things they have to do.
Finally, cleaning up your code for presentation is a final opportunity for you to discover any mistakes before you publish and then later have an embarrassing public retraction.
You could add a disclaimer that the code was worked on until it provided a satisfactory result, and no further, and is not intended for (any) use. You might even add that, except for outright, actual errors that affect the result of the research, comments are discouraged.
I often publish very bad code, terrible terrible spaghetti, it's not how I write code at my job, because at my job, I'm paid to produce not only working and correct code, but also code that is maintainable and understandable and follows certain practices.
However, my hobby is not writing corporate code, but writing code that get done what I want to get done, nothing more, and sometimes less. It might even have actual bugs in it that I can plainly see and don't care about because they don't affect my uses
If people can't tell the difference, I don't care, not my problem. If a future employer can't tell the difference, I won't work with them.
I suggest publishing the code as is on something such as Github, Gitlab, etc. I suspect you have ideas on how you can improve the code, perhaps there's even a way of improving your research methodology by doing so, enabling new insights with further research. If you did a follow up experiment with improved analysis enabled by your improved code, then that's another paper, and another (more cleaned up) version of the code to push to the repository.
The above is all supposition though, as I don't know your field. If deep learning then the above seems more likely. If your field is geology, then improvements in the software might not enable better insights.
I'd say, grad student owes nobody anything until they finish, because they're bearing the greatest risk of losing priority, and the openness of science is being used against them. Nothing lost by waiting until they have their degree in the bag before sharing. Then clean it up and use it as part of your portfolio. Or append it to your thesis. Advancing science after you've secured your career is a fair compromise.
I love open source and open science, but also look back on my own graduate studies, and I chose a topic that was protected by virtue of a large capital investment plus domain knowledge that was not represented by code. Also, my thesis predates widespread use of the Internet. ;-)
2. Code IS a competitive advantage. Some times you’ll reach out to the author to ask for clarification. And after some back and forth they’ll just suggest you send them the data and put them on the paper because they don’t really want to disclose the details or the method they’ve previously published.
3. I don’t think you’ll have issues if you share less than perfect code. Most reviewers are as bad at production code as you are.
All in all, I think sharing code advances science. Yes, there’s gatekeeping, tricks to keep the knowledge inside the lab. But didn’t you choose the field because you want to advance the knowledge, help humankind? Making your research more reproducible by sharing the source code is a step in that direction.
The fact that your code is a mess means that it might be buggy; if other people can see your code, someone might find a bug in it. As you said, this is a good thing for open science, and makes your work easier to reproduce.
Nowadays margins are large enough and cost nothing or next to nothing, and you don't probably have any other use of your code, so what would be the advantage for you in not publishing it?
What kind of competitive advantage does it give to you? (what many scientists think might be not as relevant as what you think about this "competitive advantage" secifically in your specific case/field)
About "cleaning it", why?
I mean, if as-is it works (but it is "ugly") it still works, what if in the process of "cleaning it" you manage to introduce a bug of some kind?
Unless you plan to also re-test it after the cleaning, I guess it would be better to not clean it at all.
"Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail."
If it is too much work to refactor the code for publishing, you can also just publish pseudocode.
I don't think anyone will nitpick or criticize coding style or things like that unless it is particularly egregious (ie naming variables something vulgar etc). The point of research papers is to communicate new and valuable findings. If people in this conference or journal are nitpicking things like that, you may want to find a different place to submit your work.
I don't know what your field is, but in Computer Science I can't say I have ever known people to consider their code a competitive advantage. The only time they might shy from releasing code is when they think they can commercialize it or something.
Ideally, it would be nice if the code has a professional-level quality to it, but I think everyone involved in evaluating research understands that it is at best a prototype. Proper software engineering is expensive, and it is not the role of research to do this. The process, as it was explained to me: university research pushes the state of the art, industrial research labs are slightly behind this and looking to transfer into practical uses (along with this some government agencies are interested in tech transfer) and finally software engineering takes these ideas and turns them into actual products. You aren't making a product, so it is OK for the code not to be perfect (also, from experience, 'professional' industry code is not always that great either). The main point is that someone has some chance of reproducing your results.
The exception to this is if you are making a product, where the definition of product is a tool for further research. Examples might be tools for symbolic execution or formal verification, in which case it might be worth some time to make the experience of using it good for that benefit, to reduce friction so that people try and want to use your tool.
Artefact evaluation is rapidly becoming something people are encouraged to do and helps enormously in verifying results, but the point is usually to try to reproduce the results of the paper to back up the science, not to start an argument over coding style. I would hope that artefact evaluation processes make this clear and ensures that evaluations of artefacts focus on reproducibility. For outside comments that might arise, I suggest you publish the work as open source and respond to any criticisms with a fairly standard line: yes this is research quality code and we would like to have time to improve it. If you would like to submit a patch/pull request we would welcome any help.
If you want real protection of course you can always try to get a patent, but then I've got you because 90% of the people I have this conversation with are worried about people stealing their idea but don't think it is patent-worthy.
A similar analogue exists in startups: ideas are really a dime a dozen. Execution is what matters. There are millions of great startup ideas floating around -- I bet almost anyone could come up with at least a few that are viable -- but actually having the follow-through and dedication to execute that idea, that is what is challenging. I can't tell you how many people I've had calls with where the exchange is basically "I want your thoughts on this amazing idea but you have to sign an NDA first". 90% of the time these people aren't willing to go all-in on their idea and stake their career on it (hence them seeking second opinions), so it makes no sense for them to worry about me "stealing" their half-baked, unrealized idea. I say to them "would you take $3M in interest-free debt to develop this idea right now" and they say "no!" to which I say "then why should I sign an NDA?"
What is useful is if you can produce code people can build on and do their own cool stuff with -- then they will cite you. However, getting something to a state where it is tested for all reasonable inputs, has some basic docs, etc. is a hard untaking.
https://github.com/minion/minion (C++ constraint solver)
https://github.com/stacs-cp/demystify (Python puzzle solver)
https://github.com/peal/vole (Rust group theory solver)
What can go "wrong"
- Someone may find a minor rounding error and now you have to issue a correction to the paper which, laudable as it is, is a bad thing
- You 'll end up having to maintain an open-source-something and possibly forks
- Your open source code may end up as a github repo in which you are just one of the contributors, not the owner and others are leeching credit from u
- People who want to criticise you will find excuses in the coding style.
Research code is messy -- it must be messy imho, or else it's probably insignificant. People who don't publish it are definitely shielded by the obscurity , while i have received scrutiny for entirely inconsequential details. You can choose to publish it in a less accessible way , which will thwart people with bad intentions. Even publishing it as a tarball in a web server is enough work to keep them away.
[1]: https://github.com/adewes/superconductor [2]: https://github.com/adewes/pyview https://github.com/adewes/python-qubit-setup
"THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
So just pick one that is compatible with the 3rd party code you used to write your software (mostly pertaining to copyleft licenses like the GPL) - MIT and BSD licenses are generally "fine" - and just publish it. Just because your code is not "clean" or whatever doesn't preclude it from being free.
Now in the vast majority of cases you will only get a couple of people looking at your code (my experience so far), but still I think it's worth it. The question is, clean up the code or not. Ideally you would, because it increases the chance of someone using it by a lot. On the other hand with the realities of academic work, this is largely underappreciated.
So I recommend to find a balance, clean up enough so it is reasonably straight forward to run the code. Write a good readme that points to the paper and gives the appropriate citation.
Science that is not reproducible is not science.
If you can, publish something high-level. Matlab or Python or Julia is fine. C or Java, not so much, because the build environment will not be available any longer after a few years. Actually, if you can, publish several translations.
And don't forget to publish your data sets as well. And your data augmentation or whatever. Everything you need to reproduce your results.
And for the love of Knuth, DO NOT OPTIMIZE YOUR CODE. Dumb code is good code in science. You would not believe what kinds of havoc some algorithms wreaked on my systems in the name of optimization. Optimizations that made a ten-year-old algorithm run in two nanoseconds instead of four (vastly exaggerated). Optimizations that obfuscated otherwise perfectly reasonable algorithms.
The goal is reproducibility.
https://github.com/DarwinAwardWinner/cd4-histone-paper-code
The main points are that I made only a minimal attempt to organize it, and I made the state of the code clear in the README. I don't recall anyone complaining about the code or even mentioning it during review. (Though to be fair, I also don't recall whether I published the code before or after the paper was accepted.)
Looking at things from the other side, I'm am at least an order of magnitude more likely to read, use the work/methods from, and therefore cite a paper that comes with code.
You should have confidence in the correctness of your code if you are publishing.
If your code is a shitshow, why do you trust it? Decent code is to your own advantage even if no one else ever looks at it.
In the best case, it’s possible to build a community around your code, to wide benefit and your career benefit. I’ve seen this with several peers and students.
As a hiring manager, it’s very nice indeed to read a paper and scan the code of an fresh grad applicant.
My lab’s approach is to put the repo in public and put the hash of the relevant commit in the paper. Then you can keep developing there but readers can be confident they can get the exact code used to justify the claims in the paper.
An exception is if you plan to make a company around your IP. You should estimate how likely this is to happen before defaulting to this.
The obvious answer for science is: publish. The goal of science should be to make it easy for others to reproduce your work. Not to make it theoretically possible, but hard, because of the "competitive advantage".
The right thing to do would be to publish and next time you review another paper that does not publish code use that as a reason to reject it. The whole "code and data upon request" is obvious bullshit, there have been studies on it and often enough it ends up with "well, we don't have that code/data any more", "why do you need that? we won't help you plan to publish something we don't like" etc. pp.
In your position, I would only release code which is not too hard to reproduce anyway or which only provides negligible competitive advantage for you. I mainly have "normal" paper in mind (experiments or data analysis) - if the main contribution is, for example, an algorithm which you want people to use, the you should publish an implementation obviously.
Every researcher thinks this, and it's always wrong. If you care about scientific progress, publish the code and data.
Besides, available code should cause more people to look at your work and ultimately cite it.
That "competitive advantage" is just holding everyone back, slowing progress. This is particularly annoying to hear coming from "research" which I thought was supposed to be advancing the state of the art for the benefit of society. That's ostensibly the reason for publishing papers right, to disseminate knowledge? Or is it really just to increase ones ego and get paid?
Not saying you should publish code, just that deliberately keeping secrets in your field seems to go against what I thought you were doing.
If the purpose is to push human knowledge forward, then it seems backwards not to publish everything.
Personally, I've found it difficult in my various careers to date when I've been put in positions where the actions that serve my immediate interests are in any way in conflict with my underlying principles or overarching goals. It's demotivating and deflating.
If I were in your position, I would publish everything and let myself feel pride in what I did. Even if we're all just insignificant specks in the grand scheme of things, pursuing a greater purpose can help make it feel like something matters.
I should take a couple of hours. The code works? You know how to reproduce what you did, right? It shouldn't be perfect. Shouldn't even pass code review. Should just work.
> many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
Well depends on the field I guess, but you also want recognition and impact. What is the point of publishing a result no one uses?
Unlikely. Following the algorithm from scratch may produce "similar" results, but not "reproduce", bugs and all. The only thing that can do that is your code.
Plus, typically, when you set out to reproduce a paper from only the algorithmic description, it's typically not until you're 2 or 3 weeks into coding that you realise the original paper made many assumptions in the code that were not explicitly stated in the paper.
> However, the implementation can easily take two months of work to get it right.
An even more important reason why you should release your code.
> In my field many scientists tend to not publish the code nor the data.
A regrettable state of affairs indeed.
> They would mostly write a note that code and data are available upon request.
I have personally come across many cases where this promise could no longer be honoured by the time of the request. Publish the code.
> I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.
It is also increasingly a requirement for funding bodies
But on the other hand it's substantially more work to clean and organize the code for publishing
> Then don't. Release it under the CRAPL, stating as much. It is still better than nothing.
> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).
If you were an entrepreneur hoping to peddle snake oil and not get found out, then I would see your point. But you're a scientist, you're supposed to welcome such criticism and opportunities for improvement. If anything, you might even get collaborations / more publications on the basis of improving on that code.
> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
I would sincerely not feel very comfortable calling such people "scientists".
> But on the other hand it's substantially more work to clean and organize the code for publishing
(B) Don't spend time cleaning code for publishing. Spend your time writing more papers.
> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).
(C) Don't worry about this.
> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.
(D) If you do B, if will also reduce your worries about this. I am half joking.
While any published code receives some nitpicking and bikeshedding, most academic code is terrible so unless you literally use random joke/meme variable names as your only 'documentation' (I wish I were joking) you're not going to look bad to anyone who matters.
If you want to do the world or further favor, get a grad student to read it first and indicate where they cannot follow the code. In my brief stint in academia, I saw very little overlap between brilliant theoreticians coming up with novel approaches and code to support them and people who knew how to write readable code.
If the answer to the above is no, and it will mostly cost you time and effort. Then don't publish.
If the answer to the above is yes, then consider the return on investment for publishing your code. If you earn more reputation/money/whatever if you publish than what you expenditure on doing the work of publishing, then publish, if not, then don't.
This is probably wrong, depending on the field. At least in machine learning, the papers that get cited the most are those that other people can easily pick up and work on. They become the basis for future work, get cited as baselines more often, etc. Publishing research ML code is a competitive advantage.
Similarly, I’ve found papers that don’t include their complete data set in the paper, and had to try to reverse engineer it from images and so on. It is really frustrating when papers are incomplete.
I wish it wasn't viewed as a competition in the first place.
Do it. There's no good reason not to.
You might be doing a young student a solid :D And don’t worry about cleaning it up!
If you use GitHub you could even disable Issues and have a note saying you don’t accept pull requests (in case you’re worried about support burden).
https://alltamedia.com/2014/04/14/how-to-make-a-link-or-butt...
- People who use your work will cite you.
- You may get collaborators.
- It's an easy-to-get-to backup
- For non-academic jobs, it's part of your resume
You shouldn't even be able to ask this question. The journal should have required you to first or along with the paper publish the code.
Unfortunately, the number of Journals that do this is still small and even the ones that do sometimes are even satisfied with a "Code can be obtained upon request".
So, yes. Please publish the code, it will make the rest of the paper stronger.
Personally, I hate it when academics do not publish their code. Some academics publish the code but not the pretrained model or withhold the dataset, to collect dust on their computer.
People who publish code, datasets and models become the core building blocks of future work. People who don't fade away people do not remember their names.
The negatives are overestimated, is unlikely many people will read the code.
If the paper is enough to reproduce the results AND cleaning up the code can/is tedious, then adding the "code and data are available upon request" note seems both fair and justified.
That way, whoever wants the code can still ask for it and it does not lay an unnecessary burden on the author.
While I appreciate this is true, it’s also quite sad. Science shouldn’t be a competitive sport to increase a couple metrics like publications and citations such that useful parts of replicating and extending studies aren’t shared. :(
One of my most cited papers is a relatively uninteresting one we wrote for a conference competition. But we have code so it is easy to compare your alternative approach to us. That means citations.
So it can work for your benefit as well.
There have been times when I've had to abandon incorporating an idea presented in a research paper because the paper doesn't have enough information for me to implement it in code. I could've made a lot of progress with some proof of concept code, even if it wasn't clean.
If you published a paper that uses information from the code then yes you absolutely must publish your code. Otherwise you're contributing to the decline of science via the opaqueness of papers and irreproducibility problem.
If people want great code that runs easily and is easy to read, that's engineering work, built off the back of novel implementations.
If people want novel implementations that are likely rough around the corners and require a bit of finagling to run, leave that to the scientists.
Put a huge note in the readme that this is research code and only licensed for non commercial use.
Put a note on your personal homepage that you're available to hire as a research consultant for $1000 per day.
Companies who like your research will put 1+1 together. A friend of mine got hired straight out of university at a very competitive salary with this approach.
Make sure it's all safe to publish but don't spend any effort on organizing it, unless you can find some grant money for an undergrad to work on it.
If it has users they will contribute their changes to better organize it and use it.
I previously published in one of them (SoftwareX by her majesty Elsevier the Evil) and I wish there could be more venues that could bring value and recognition to the piece of code we develop in research for other purposes.
1. encumbered by pending or active patent(s)?
2. release of proprietary holds by corporates or participants
3. any tangible market values worth pursuing, then keep it to yourself.
4. any conflict with trademarks, copyrights, or domain hold? Rename it
that’s just some of the points. Contact your local VCs if it has any traction.
This is unfortunately. In one of my articles I linked to my github repo where I had implemented the algorithm in C. One of my reviewers complained that I had used C instead of C++. Probably advisable to not publish code before peer review.
To me at least sends a signal of people hiding stuff. That's not good. It made me distrust some papers in the past. I tried to reach out with no success.
So identify what's most critical or novel about your work and publish that.
Like others have said, research code isn't meant to be production quality code so I wouldn't worry about "quality" in that way.
Frankly any paper which can't offer the basics of reproducibility is adding to the current problems across many fields.
Real-World data may restricted by copyright so be careful if this applies. If it does consider publishing with some MC data demonstrating how things are supposed to work. (You did verify your code behaviour didn't you?)
Don't clean and organise code for publishing. It is a tool, it is not 100% perfect, but it is supposed to work. Unfortunately after years in the field sometimes the correct response to nit-picking is "I don't care".
This is the trap between "writing code that was intended to give an answer" and "code that was intended to be re-used by others". Scientists often write code that fits into the former and this code should be published (in-case of mistakes and in the interests of reproducibility). But this code should never be taken to be of the quality that it should be built on by others unless this was the express intent. People who mistake it for that haven't understood the point of the work the author is engaging in.
With regard to what license, I tend to use DWTFYWWI, or just GPL, but frankly you can pick some wonderfully closed thing if you think your code might revolutionise something which in principle stops commercial entities ripping it off directly.
The thing is that I was required to provide a way to reproduce, so code obfuscated and/or uncommented were not a problem. I provided clean code anyway.
For me, it shows the authors are confident yet also open to critique. Which is a wonderful thing.
Secondly, I usually need the code to really understand the paper.
Agree with other comments on CRAPL, but you should release it.
Worst case scenario, it will end up in a star-less github repo that nobody reads.
Science progresses by criticism, after all.
A strong result isn’t just the final number, it’s also the process how you arrived there.
Published terrible code is far better than unpublished code.
If someone has comments about style ask them to improve it for you.
Worry about maintaining things after someone asks for maintenance, the vast majority of code is never read again.
Also, include basic instructions for running your code.
I helped my wife with a replication study that should have been straightforward, and I was unable to get the code running after about a week. I don’t necessarily believe the research was suspect, but broken code does draw more suspicion.
But to be honest, I am truly underwhelmed by the response. For several papers I created Jupyter notebooks that reproduce every single figure in the paper. It has been a huge amount of work. But in spite that the papers with code are reasonably often cited, I’ve been getting only minimal feedback.
So it‘s really difficult to judge whether the effort of properly preparing the code is worth the effort.
On the other hand I have run into several papers that turned out to be not reproducible without the code. Chances are that these particular papers would not have been reproducible with the code, too :D (there were just too many things not adding up). But it would have saved us a lot of time if the code would have been available.
Tl;dr: make the code available, but don‘t invest too much time in polishing it. Hardly anyone is going to thank you.
One exception: if you want to impress future employers, polishing code is worth it. A good portfolio on GitHub can open doors.
The paper alone is, almost always, never enough to fully reproduce the result. I've been bitten by this almost every time I've tried to implement someone else's computational model. It comes down to that only relying on your paper to explain your code leaves a LOT of room for errors. I've experienced all of these when trying to implement someone else's computational work without their code being published:
1. Despite your best efforts, you include fundamental, result-breaking typos in the equations you write up to explain the math of what you're doing. This WILL happen to you at some point in your career, and in my experience, it's a problem in >>50% of computational modeling papers.
2. There are assumptions in the logic of the code that you don't include in the writeup, since they're obvious to you, but you don't realize that someone else trying to understand your paper won't necessarily be starting with those same assumptions. This happens frequently with neural models that use complicated synapse-computation schemes.
3. Your codebase may be big enough that you think code part X works a certain kind of way from memory, but you forget that you changed the logic late in the project to work in a different way.
4. Publishing your code at the time of publication prevents "Which version did I use?" problems. It's very common for people to continue to work on their science code for new work, but they don't bother to save/tag a SPECIFIC version of their code that was used for the actual paper. This results in that even the author doesn't know what exact values were used for the results in the paper!
Any "competitive advantage" has to be weighed versus "positive exposure". If your code is the primary research object (as opposed to the data), then it's technically possible that someone may grab your code, extend it to do the next, interesting use of it, and then scoop you before you can do it yourself. However, even if this happens (which it probably won't), consider the following: 1. You can't build a successful career out of just small extensions to the same piece of code, and so that codebase won't be the main kernel of your career, but rather your understanding of it.
2. For every 1 person that tries to use that to scoop you, IMHO there's going to be at least 10 other people who see your code and reach out to you for help with it, or just to ask a question about it, or reach out for potential collaboration! In other words, depending on the field, if you publish the code, I think you're likely to gain new/future collaborators at a MUCH faster rate than people who compete against you. You'll be surprised at how many researchers on the other side of the planet are interested in your software!
3. Even if someone scoops you with your own code, if they give any indication it came from you, you still get to count that as a publication that built off of your software work when you're applying to jobs :)
4. At least with US federal government funding, it's gradually becoming required to do this anyways, and I believe/hope that it's going to become the standard anyways very soon.
Finally, don't fret about polishing/cleaning/organizing the code, especially style. For others trying to reproduce your results or just investigating how you did things, the main thing that matters is that your code runs "correctly", i.e. how you ran it to get the results that you did. One idea is to publish it "as is" for the CORRECTNESS of the paper, put a git tag indicating "original version", and THEN clean it up on Github/wherever. This helps prevent any new "organizing" of the code from potentially breaking something, which is counterproductive. This way, when people go to your code page, the first thing they see is a nicely-organized version, and gives you time to test that it works the same. Honestly, if you care enough about this at all, then your code is probably significantly more organized than 95% of research code out there; the standards of code quality in science are VERY low, which is completely different than private sector software engineering.* edits are for markup
someone might clean it up for you, too
2. push it there
1. a program that I can run against the data in the paper (where I can modify the data to see how that changes the results the program generates); and
2. the source code to that program, that I can read to understand what it does.
For #1, I'd encourage you to publish something like a Docker image of your built binary, to a permanent public Docker image host; to use that Docker image version of your program to do the actual experiment/data processing for your paper; and then to cite, in your paper, the specific fully-qualified Docker image ID (e.g. hub.docker.com/foo/bar@sha256:abcdef0123...6789) that was used to create the results.
I would also encourage you to, if possible, publish your data in some repository, e.g. GitHub; and to cite the data using a fixed hash (e.g. Git commit hash) as well.
With these two pieces of information, anyone can easily do the simplest possible kind of "reproduction" of your results: namely, they can fetch the same Docker image used in the paper, and then run it against the same data used in the paper, to — hopefully — produce the same results shown in the paper.
---
As for #2...
If you're really worried about "trade secrets", you can just solve #2 by making the code itself only "available upon request."
But don't underestimate the number of people in your field who say they're hoarding their code for reasons of "competitive advantage", but who are really doing so out of personal shame at the state the code is in, and fear that a bug might be found there that will invalidate their result.
These people are, IMHO, not embracing the spirit that led them to become scientists. You should want any bugs in your papers — including in the code — to be found! That's what the pursuit of (academic) science is about — everyone checking each-other's work so that we can all believe more strongly in the results!
You don't need to clean up your code. Maybe get an "alpha reader" to go over it first, like self-published authors do, if you're worried about nitpickers. But the only thing code really "needs" to be valuable, is to compile and run and do something useful.
Personally, all I'd want from your repo is for there to be a Dockerfile in there that will, within its fiddly little internal build environment, manage to output the exact Docker image cited in the paper.
If I cared about modifying the code, I could take the rest from there.