Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • CAREER FEATURE
  • 01 April 2024

How scientists are making the most of Reddit

  • Hannah Docter-Loeb 0

Hannah Docter-Loeb is a freelance writer in Washington DC.

You can also search for this author in PubMed   Google Scholar

It has been almost 18 months since Elon Musk purchased Twitter, now known as X. Since the tech mogul took ownership, in October 2022, the number of daily active users of the platform’s mobile app has fallen by around 15%, and in April 2023 the company cut its workforce by 80%. Thousands of scientists are reducing the time they spend on the platform ( Nature 613 , 19–21; 2023 ). Some have gravitated towards newer social-media alternatives, such as Mastodon and Bluesky. But others are finding a home on a system that pre-dates Twitter: Reddit.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 628 , 221-223 (2024)

doi: https://doi.org/10.1038/d41586-024-00906-y

Fiesler, C., Zimmer, M., Proferes, N., Gilbert, S. & Jones, N. Proc. ACM Hum. Comp. Interact. 8 , 5 (2024).

Article   Google Scholar  

Proferes, N., Jones, N., Gilbert, S., Fiesler, C. & Zimmer, M. Soc. Media Soc . https://doi.org/10.1177/20563051211019004 (2021).

Download references

Related Articles

research articles on reddit

  • Information technology

Researcher parents are paying a high price for conference travel — here’s how to fix it

Researcher parents are paying a high price for conference travel — here’s how to fix it

Career Column 27 MAY 24

How researchers in remote regions handle the isolation

How researchers in remote regions handle the isolation

Career Feature 24 MAY 24

What steps to take when funding starts to run out

What steps to take when funding starts to run out

Lack of effective intercultural communication is hobbling academia — fix it for research equity

Correspondence 21 MAY 24

The dream of electronic newspapers becomes a reality — in 1974

The dream of electronic newspapers becomes a reality — in 1974

News & Views 07 MAY 24

A global timekeeping problem postponed by global warming

A global timekeeping problem postponed by global warming

Article 27 MAR 24

AI image generators often give racist and sexist results: can they be fixed?

AI image generators often give racist and sexist results: can they be fixed?

News Feature 19 MAR 24

Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Warmly Welcomes Talents Abroad

“Qiushi” Distinguished Scholar, Zhejiang University, including Professor and Physician

No. 3, Qingchun East Road, Hangzhou, Zhejiang (CN)

Sir Run Run Shaw Hospital Affiliated with Zhejiang University School of Medicine

research articles on reddit

Associate Editor, Nature Briefing

Associate Editor, Nature Briefing Permanent, full time Location: London, UK Closing date: 10th June 2024   Nature, the world’s most authoritative s...

London (Central), London (Greater) (GB)

Springer Nature Ltd

research articles on reddit

Professor, Division Director, Translational and Clinical Pharmacology

Cincinnati Children’s seeks a director of the Division of Translational and Clinical Pharmacology.

Cincinnati, Ohio

Cincinnati Children's Hospital & Medical Center

research articles on reddit

Data Analyst for Gene Regulation as an Academic Functional Specialist

The Rheinische Friedrich-Wilhelms-Universität Bonn is an international research university with a broad spectrum of subjects. With 200 years of his...

53113, Bonn (DE)

Rheinische Friedrich-Wilhelms-Universität

research articles on reddit

Recruitment of Global Talent at the Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

The Institute of Zoology (IOZ), Chinese Academy of Sciences (CAS), is seeking global talents around the world.

Beijing, China

Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

research articles on reddit

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Social Media Research

  • Considerations when researching social media
  • Collected tools
  • Additional resources

Reddit TOS and API

  • Reddit's TOS The Reddit TOS is more permissive of research use than Meta's platforms, partially by omission. It does not directly address research on Reddit, but it does allow for automated capturing of posts via the API.
  • Reddit's API documentation The Reddit API exposes most of the site's content to automated collection. Some rules are linked on this page as well, which are fairly straightforward and permissive to research uses.

Reddit's TOS and API rules do not contain the sort of blanket bans on automated data collection that Meta's TOSes do, but they do not contain any specific provisions for research use either. Like Twitter, the API helps collect posts as they happen rather than archiving all posts on the site.

Tools for Reddit research

  • Netlytic Netlytic is a browser-based social media research tool that has text mining and network visualization features. Works with Twitter, YouTube, RSS feeds, and Reddit. Free accounts are sufficient for most student purposes. Netlytic has a YouTube channel with demonstrations for a variety of types of project.
  • Mozdeh Mozdeh is a social media quantitative analysis FOSS software that can also collect tweets, like Netlytic or Chorus. It works with the same things as Netlyltic: Tweets, YouTube comments, Reddit comments, and manually imported data. Unlike Netlytic, it is a desktop app. It also has a YouTube channel where you can find guides to collecting and analyzing data.
  • Reaper Reaper, built on the socialreaper Python library, is a desktop app with no coding required. While it calls what it does "scraping", it makes use of site APIs and the user will need to register for an API key for any site they want to use Reaper on. This includes Facebook, Twitter, Reddit, YouTube, Tumblr, and Pinterest. It outputs all data as .csv tabular files.
  • PRAW: the Python Reddit API Wrapper PRAW is a Python library for working with the Reddit API.
  • 4CAT 4CAT is a relatively advanced tool for the collection and analysis of social media data - it's best run on a UNIX server and has dependencies that it does not automatically install itself - but with the upside that it has modules built to work with important but niche platforms like 4chan, 8kun, Parler, and more, as well as Twitter and Reddit.
  • pushshift.io Pushshift is a popular wrapper for the Reddit API used with the requests package in Python. Documentation on pushshift.io is there, but tutorials must be found elsewhere.
  • Here's one tutorial on how to use the pushshift.io wrapper in Python.

Example publications in Reddit research

  • Using Data from Reddit, Public Deliberation, and Surveys to Measure Public Opinion about Autonomous Vehicles ABSTRACT: When and how can researchers synthesize survey data with analyses of social media content to study public opinion, and when and how can social media data complement surveys to better inform researchers and policymakers? This paper explores how public opinions might differ between survey and social media platforms in terms of content and audience, focusing on the test case of opinions about autonomous vehicles. more... less... The paper first extends previous overviews comparing surveys and social media as measurement tools to include a broader range of survey types, including surveys that result from public deliberation, considering the dialogic characteristics of different social media, and the range of issue publics and marginalized voices that different surveys and social media forums can attract. It then compares findings and implications from analyses of public opinion about autonomous vehicles from traditional surveys, results of public deliberation, and analyses of Reddit posts, applying a newly developed computational text analysis tool. Findings demonstrate that social media analyses can both help researchers learn more about issues that are uncovered by surveys and also uncover opinions from subpopulations with specialized knowledge and unique orientations toward a subject. In light of these findings, we point to future directions on how researchers and policymakers can synthesize survey and social media data, and the corresponding data integration techniques, to study public opinion.
  • Studying Reddit: A Systematic Overview of Disciplines, Approaches, Methods, and Ethics ABSTRACT: This article offers a systematic analysis of 727 manuscripts that used Reddit as a data source, published between 2010 and 2020. Our analysis reveals the increasing growth in use of Reddit as a data source, the range of disciplines this research is occurring in, how researchers are getting access to Reddit data, the characteristics of the datasets researchers are using, the subreddits and topics being studied, the kinds of analysis and methods researchers are engaging in, and the emerging ethical questions of research in this space. more... less... We discuss how researchers need to consider the impact of Reddit’s algorithms, affordances, and generalizability of the scientific knowledge produced using Reddit data, as well as the potential ethical dimensions of research that draws data from subreddits with potentially sensitive populations.
  • << Previous: Facebook
  • Next: YouTube >>
  • Last Updated: May 6, 2024 3:22 PM
  • URL: https://subjectguides.library.american.edu/socialmediaresearch

The Anatomy of Reddit: An Overview of Academic Research

  • Conference paper
  • First Online: 14 May 2019
  • Cite this conference paper

research articles on reddit

  • Alexey N. Medvedev 24 ,
  • Renaud Lambiotte 25 , 26 &
  • Jean-Charles Delvenne 27  

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Included in the following conference series:

  • Dynamics on and of Complex Networks

2896 Accesses

42 Citations

Online forums provide rich environments where users may post questions and comments about different topics. Understanding how people behave in online forums may shed light on the fundamental mechanisms by which collective thinking emerges in a group of individuals, but it has also important practical applications, for instance, to improve user experience, increase engagement or automatically identify bullying. Importantly, the datasets generated by the activity of the users are often openly available for researchers, in contrast to other sources of data in computational social science. In this survey, we map the main research directions that arose in recent years and focus primarily on the most popular platform, Reddit. We distinguish and categorize research depending on their focus on the posts or on the users and point to different types of methodologies to extract information from the structure and dynamics of the system. We emphasize the diversity and richness of the research in terms of questions and methods and suggest future avenues of research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

https://en.wikipedia.org/wiki/Reddit .

https://en.wikipedia.org/wiki/Slashdot .

https://en.wikipedia.org/wiki/Hacker_News .

https://en.wikipedia.org/wiki/Digg .

https://praw.readthedocs.io/en/latest .

The authors used karmadecay.com —the reverse image search tool specifically designed for Reddit.

Aragón, P., Gómez, V., Kaltenbrunner, A.: Visualization tool for collective awareness in a platform of citizen proposals. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, pp. 756–757 (2016)

Google Scholar  

Aragón, P., Gómez, V., García, D., Kaltenbrunner, A.: Generative models of online discussion threads: state of the art and research challenges. J. Internet Serv. Appl. 8 (1), 15 (2017)

Article   Google Scholar  

Aragón, P., Gómez, V., Kaltenbrunner, A.: To thread or not to thread: the impact of conversation threading on online discussion. In: International AAAI Conference on Web and Social Media (2017)

Backstrom, L., Boldi, P., Rosa, M., Ugander, J., Vigna, S.: Four degrees of separation. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 33–42. ACM, New York (2012)

Bandari, R., Asur, S., Huberman, B.A.: The pulse of news in social media: forecasting popularity. In: Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 12, pp. 26–33 (2012)

Bishop, J.: The effect of de-individuation of the internet troller on criminal procedure implementation: an interview with a hater. Int. J. Cyber Criminol. 7 (1), 28–48 (2013)

Chandrasekharan, E., Pavalanathan, U., Srinivasan, A., Glynn, A., Eisenstein, J., Gilbert, E.: You can’t stay here: the efficacy of Reddit’s 2015 ban examined through hate speech. Proc. ACM Hum.-Comput. Interact. 1 , 31 (2017)

Chandrasekharan, E., Samory, M., Jhaver, S., Charvat, H., Bruckman, A., Lampe, C., Eisenstein, J., Gilbert, E.: The internet’s hidden rules: an empirical study of Reddit norm violations at micro, meso, and macro scales. Proc. ACM Hum.-Comput. Interact. 2 , 32:1–32:25 (2018). http://doi.acm.org/10.1145/3274301

Cohen, R., Havlin, S.: Scale-free networks are ultrasmall. Phys. Rev. Lett. 90 (5), 058701 (2003)

Article   ADS   Google Scholar  

Das, S., Lavoie, A.: The effects of feedback on human behavior in social media: an inverse reinforcement learning model. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, pp. 653–660. International Foundation for Autonomous Agents and Multiagent Systems (2014)

Derczynski, L., Rowe, M.: Tracking the diffusion of named entities. (2017, preprint). arXiv:1712.08349

Dommers, S., Van Der Hofstad, R., Hooghiemstra, G.: Diameters in preferential attachment models. J. Stat. Phys. 139 (1), 72–107 (2010)

Article   ADS   MathSciNet   Google Scholar  

Fang, H., Cheng, H., Ostendorf, M.: Learning latent local conversation modes for predicting comment endorsement in online discussions. In: Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pp. 55–64 (2016)

Gaffney, D., Matias, J.N.: Caveat emptor, computational social science: large-scale missing data in a widely-published Reddit corpus. (2018, preprint). arXiv:1803.05046

Gilbert, E.: Widespread underprovision on Reddit. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 803–808. ACM, New York (2013)

Glenski, M., Weninger, T.: Predicting user-interactions on Reddit. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 609–612. ACM, New York (2017)

Glenski, M., Pennycuff, C., Weninger, T.: Consumers and curators: browsing and voting patterns on Reddit. IEEE Trans. Comput. Soc. Syst. 4 (4), 196–206 (2017)

Gómez, V., Kaltenbrunner, A., López, V.: Statistical analysis of the social network and discussion threads in Slashdot. In: Proceedings of the 17th International Conference on World Wide Web, pp. 645–654. ACM, New York (2008)

Gómez, V., Kappen, H.J., Kaltenbrunner, A.: Modeling the structure and evolution of discussion cascades. In: Proceedings of the 22Nd ACM Conference on Hypertext and Hypermedia, pp. 181–190 (2011)

Gómez, V., Kappen, H.J., Litvak, N., Kaltenbrunner, A.: A likelihood-based framework for the analysis of discussion threads. World Wide Web 16 (5–6), 645–675 (2013)

Gonzalez-Bailon, S., Kaltenbrunner, A., Banchs, R.E.: The structure of political discussion networks: a model for the analysis of online deliberation. J. Inf. Technol. 25 (2), 230–243 (2010). https://doi.org/10.1057/jit.2010.2

Halfaker, A., Keyes, O., Kluver, D., Thebault-Spieker, J., Nguyen, T., Shores, K., Uduwage, A., Warncke-Wang, M.: User session identification based on strong regularities in inter-activity time. In: Proceedings of the 24th International Conference on World Wide Web, pp. 410–418. International World Wide Web Conferences Steering Committee, Geneva (2015)

Hamilton, W.L., Zhang, J., Danescu-Niculescu-Mizil, C., Jurafsky, D., Leskovec, J.: Loyalty in online communities. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, vol. 2017, p. 540. NIH Public Access (2017)

Hanson, W.A., Putler, D.S.: Hits and misses: herd behavior and online product popularity. Mark. Lett. 7 (4), 297–305 (1996)

Hessel, J., Tan, C., Lee, L.: Science, askscience, and badscience: on the coexistence of highly related communities. In: The Tenth International Conference on Web and Social Media (ICWSM), pp. 171–180 (2016)

Hessel, J., Lee, L., Mimno, D.: Cats and captions vs. creators and the clock: comparing multimodal content to context in predicting relative popularity. In: Proceedings of the 26th International Conference on World Wide Web, pp. 927–936. International World Wide Web Conferences Steering Committee, Geneva (2017)

Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102 (46), 16569–16572 (2005)

Horne, B.D., Adali, S.: The impact of crowds on news engagement: a Reddit case study. (2017, preprint). arXiv:1703.10570

Horne, B.D., Adali, S., Sikdar, S.: Identifying the social signals that drive online discussions: a case study of Reddit communities. In: 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–9 (2017). https://doi.org/10.1109/ICCCN.2017.8038388

Jaech, A., Zayats, V., Fang, H., Ostendorf, M., Hajishirzi, H.: Talking to the crowd: what do people react to in online discussions? In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2026–2031 (2015)

Kaltenbrunner, A., Gomez, V., Lopez, V.: Description and prediction of Slashdot activity. In: Latin American Web Conference 2007 (LA-WEB 2007), pp. 57–66. IEEE, Piscataway (2007)

Karsai, M., Kivelä, M., Pan, R.K., Kaski, K., Kertész, J., Barabási, A.L., Saramäki, J.: Small but slow world: how network topology and burstiness slow down spreading. Phys. Rev. E 83 (2), 025102 (2011)

Kumar, S., Hamilton, W.L., Leskovec, J., Jurafsky, D.: Community interaction and conflict on the web. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 933–943. International World Wide Web Conferences Steering Committee, Geneva (2018)

Lakkaraju, H., McAuley, J.J., Leskovec, J.: What’s in a name? Understanding the interplay between titles, content, and communities in social media. In: International AAAI Conference on Web and Social Media (ICWSM), vol. 1, no. 2, 3 (2013)

Lambiotte, R., Kosinski, M.: Tracking the digital footprints of personality. Proc. IEEE 102 (12), 1934–1939 (2014)

Lee, J.G., Moon, S., Salamatian, K.: Modeling and predicting the popularity of online contents with cox proportional hazard regression model. Neurocomputing 76 (1), 134–145 (2012)

Lumbreras, A., Jouve, B., Velcin, J., Guégan, M.: Role detection in online forums based on growth models for trees. Soc. Netw. Anal. Min. 7 (1), 49 (2017)

Marckert, J.F., Mokkadem, A., et al.: The depth first processes of Galton–Watson trees converge to the same Brownian excursion. Ann. Probab. 31 (3), 1655–1678 (2003)

Article   MathSciNet   Google Scholar  

Medvedev, A.N., Delvenne, J.C., Lambiotte, R.: Modelling structure and predicting dynamics of discussion threads in online boards. J. Complex Netw. 7 , 67–82 (2018). https://doi.org/10.1093/comnet/cny010

Mishne, G., Glance, N.: Leave a reply: an analysis of weblog comments. In: Proceedings of 3rd Annual Workshop on the Weblogging Ecosystem at the 15th International World Wide Web Conference (2006)

Mojica, L.G.: Modeling trolling in social media conversations. (2016, preprint). arXiv:1612.05310

Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the sample good enough? comparing data from Twitter’s streaming API with Twitter’s Firehose. In: International AAAI Conference on Web and Social Media (ICWSM) (2013)

Moyer, D., Carson, S.L., Dye, T.K., Carson, R.T., Goldbaum, D.: Determining the influence of Reddit posts on Wikipedia pageviews. In: Proceedings of the Ninth International AAAI Conference on Web and Social Media (2015)

Muchnik, L., Aral, S., Taylor, S.J.: Social influence bias: a randomized experiment. Science 341 (6146), 647–651 (2013)

Newell, E., Jurgens, D., Saleem, H.M., Vala, H., Sassine, J., Armstrong, C., Ruths, D.: User migration in online social networks: a case study on Reddit during a period of community unrest. In: International AAAI Conference on Web and Social Media (ICWSM), pp. 279–288 (2016)

Nishi, R., Takaguchi, T., Oka, K., Maehara, T., Toyoda, M., Kawarabayashi, K.I., Masuda, N.: Reply trees in twitter: data analysis and branching process models. Soc. Netw. Anal. Min. 6 (1), 1–13 (2016)

Saleem, H.M., Ruths, D.: The aftermath of disbanding an online hateful community (2018). Preprint. arXiv:1804.07354

Salganik, M.J., Watts, D.J.: Leading the herd astray: an experimental study of self-fulfilling prophecies in an artificial cultural market. Soc. Psychol. Quart. 71 (4), 338–355 (2008)

Sinatra, R., Lambiotte, R.: Topical issue-quantifying success. Adv. Complex Syst. 21 , 3–4 (2018)

Singer, P., Flöck, F., Meinhart, C., Zeitfogel, E., Strohmaier, M.: Evolution of Reddit: from the front page of the internet to a self-referential community? In: Proceedings of the 23rd International Conference on World Wide Web, pp. 517–522. ACM, New York (2014)

Singer, P., Ferrara, E., Kooti, F., Strohmaier, M., Lerman, K.: Evidence of online performance deterioration in user sessions on Reddit. PloS One 11 (8), e0161636 (2016)

Stoddard, G.: Popularity dynamics and intrinsic quality in Reddit and hacker news. In: International AAAI Conference on Web and Social Media (ICWSM), pp. 416–425 (2015)

Stuck_In_the_Matrix: Dataset is available on the following webpage. https://files.pushshift.io/reddit/ (Query: 2017-06-01)

Stuck_In_the_Matrix: I have every publicly available Reddit comment for research. approx. 1.7 billion comments @ 250 gb compressed. any interest in this? https://redd.it/3bxlg7 (Query: 2017-07-14)

Stuck_In_the_Matrix: Update for the Reddit corpus. https://redd.it/8aen5g (Query: 2018-09-27)

Szabo, G., Huberman, B.A.: Predicting the popularity of online content. Commun. ACM 53 (8), 80–88 (2010)

Tan, C.: Tracing community genealogy: how new communities emerge from the old. (2018, preprint). arXiv:1804.01990

Tan, C., Lee, L.: All who wander: on the prevalence and characteristics of multi-community engagement. In: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pp. 1056–1066. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2015)

Tsagkias, M., Weerkamp, W., De Rijke, M.: Predicting the volume of comments on online news stories. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1765–1768. ACM, New York (2009)

Wakefield, J.: Are you scared yet? Meet Norman, the psychopathic AI. BBC News https://www.bbc.com/news/technology-44040008

Wang, C., Ye, M., Huberman, B.A.: From user comments to on-line conversations. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’12, pp. 244–252 (2012)

Zannettou, S., Caulfield, T., Blackburn, J., De Cristofaro, E., Sirivianos, M., Stringhini, G., Suarez-Tangil, G.: On the origins of memes by means of fringe web communities (2018). Preprint. arXiv:1805.12512

Zayats, V., Ostendorf, M.: Conversation modeling on Reddit using a graph-structured LSTM. Trans. Assoc. Comput. Linguist. 6 , 121–132 (2018)

Zhang, J., Hamilton, W.L., Danescu-Niculescu-Mizil, C., Jurafsky, D., Leskovec, J.: Community identity and user engagement in a multi-community landscape. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, vol. 2017, p. 377. NIH Public Access (2017)

Zhao, Q., Erdogdu, M.A., He, H.Y., Rajaraman, A., Leskovec, J.: Seismic: A self-exciting point process model for predicting tweet popularity. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1513–1522 (2015)

Download references

Acknowledgements

This work was supported by Concerted Research Action (ARC) supported by the Federation Wallonia-Brussels Contract ARC 14/19-060; Flagship European Research Area Network (FLAG-ERA) Joint Transnational Call “FuturICT 2.0”; and by grant 16-01-00499 of the Russian Foundation for Basic Research.

Author information

Authors and affiliations.

naXys, Université de Namur, ICTEAM, Université catholique de Louvain, Louvain-la-Neuve, Belgium

Alexey N. Medvedev

Mathematical Institute, University of Oxford, Oxford, UK

Renaud Lambiotte

naXys, Université de Namur, Namur, Belgium

ICTEAM and CORE, Université catholique de Louvain, Louvain-la-Neuve, Belgium

Jean-Charles Delvenne

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jean-Charles Delvenne .

Editor information

Editors and affiliations.

Institute of Theoretical Physics, Technical University of Berlin, Berlin, Germany

Fakhteh Ghanbarnejad

Max Planck Institute for Informatics, Saarbrücken, Germany

Rishiraj Saha Roy

Department of Computational Social Science, GESIS, Leibniz Institute for the Social Science, Köln, Germany

Fariba Karimi

Université Catholique de Louvain, Louvain-la-Neuve, Belgium

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India

Bivas Mitra

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Cite this paper.

Medvedev, A.N., Lambiotte, R., Delvenne, JC. (2019). The Anatomy of Reddit: An Overview of Academic Research. In: Ghanbarnejad, F., Saha Roy, R., Karimi, F., Delvenne, JC., Mitra, B. (eds) Dynamics On and Of Complex Networks III. DOOCN 2017. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-030-14683-2_9

Download citation

DOI : https://doi.org/10.1007/978-3-030-14683-2_9

Published : 14 May 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-14682-5

Online ISBN : 978-3-030-14683-2

eBook Packages : Physics and Astronomy Physics and Astronomy (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Getting Published
  • Open Research
  • Communicating Research
  • Life in Research
  • For Editors
  • For Peer Reviewers
  • Research Integrity

Reddit 101 for Scientists

Penny Freedman

Author: Penny Freedman

When it comes to utilizing social media in the science community, you might not automatically think about including Reddit into your activities. While Reddit threads may have a reputation for being controversial, there is another side to Reddit that is both important and useful to the scientific world.

What exactly is Reddit?

You can think of Reddit as one giant virtual conference for every discipline and subject you could imagine where people break off into smaller groups -called subreddits- to talk about topics interesting to them. When you first enter Reddit it looks like one large message board. When you register for an account you can post content and vote posts up or down the page, helping to determine what will receive the most attention.

What subreddits do I even begin with?

If you’re interested in a more general discussion on science, start with  http://reddit.com/r/EverythingScience . It’s a place for people to talk about anything and everything having to do with science. You can filter by field, add your thoughts to discussions already taking place, or start a new discussion by submitting a link to something you are interested in – a blog post, video, news article, editorial, etc.

If you’re looking for a more defined discussion on peer-reviewed science, head on over to The New Reddit Journal of Science at  http://reddit.com/r/science . There you may only submit links to published peer-reviewed research. Get the conversation started on your work or a peer’s work!

What is an AMA?

An AMA is short for “Ask Me Anything.” A scientist arranges a time with Reddit moderators to discuss a specific topic related to their research or interests. You submit a brief bio and summary of what you would like to discuss, and the Reddit community is given the chance to submit questions before the AMA start time. There is a submission guide  with detailed information on how to get started with setting one up. An AMA is a great way to get a conversation started on items that are of particular interest to you, and a way to share your expertise with people interested in studying or working in the same field, or just interested in learning something new.

Springer editors and authors have hosted a few AMAs, including:

  • An AMA on rare and neglected diseases
  • An AMA on American politics
  • An AMA on realistic robots

How do I establish myself as a qualified scientist in my field to the Reddit community?

Reddit uses something called flair to designate who is a trained scientist, doctor, or engineer. The flair will present as a small bar next to your user name, noting your title and/or education level (such a Professor of Biology, PhD, etc.). When you add this bit of information people will understand that the comments you provide are knowledgeable and valuable. Once you have created your account reference  these instructions  to get your flair.

How is using Reddit any different than posting on other social media sites?

Reddit gives you the opportunity to share your knowledge and expertise in a more detailed, conversational way. You can find people discussing topics at length that you are interested in and can contribute meaningfully to. Unlike social media platforms that are centered around creating a personalized profile that is all about you, Reddit prides itself on being a community. The things you share should not be overly promotional, but should contribute to the discussion as a whole. Joining the discussion can help serve to expand your network and reach.

Penny Freedman is a Marketing Manager on the Author Experience & Services team based in the New York office. She works closely on sharing insight and guidance on the benefits and services available to our editors, reviewers, and authors.

  • social media
  • research communication
  • Tools & Services
  • Account Development
  • Sales and account contacts
  • Professional
  • Press office
  • Locations & Contact

We are a world leading research, educational and professional publisher. Visit our main website for more information.

  • © 2024 Springer Nature
  • General terms and conditions
  • Your US State Privacy Rights
  • Your Privacy Choices / Manage Cookies
  • Accessibility
  • Legal notice
  • Help us to improve this site, send feedback.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • AMIA Annu Symp Proc
  • v.2017; 2017

Tracking Health Related Discussions on Reddit for Public Health Applications

Albert park.

1 Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah

Mike Conway

We use Reddit to demonstrate social media’s potential for public health applications. First, we employ a lexicon-based approach to track the prevalence of keywords indicating public interest in Ebola, electronic cigarette, influenza, and marijuana. Second, to better understand the public reactions, we use the Latent Dirichlet Allocation algorithm, to identify either the general themes or motivations for extreme changes in the volume of discussion over time. We observe that discussions related to Ebola and influenza, infectious diseases of public health interests, surged when the first case of Ebola was diagnosed and a new strain of H1N1 influenza virus was confirmed in the United States. We also observed that discussions of a controversial health topic like marijuana increased with the announcement of a major change in United States federal policy. Discussions of electronic cigarette highlighted opportunities for better health education. Lastly, we discuss the implications of our findings for utilizing Reddit data for public health applications.

Introduction

Nearly two-thirds of American adults (65%) use social media: a nearly a tenfold increase in the past 10 years 1 . Social media provides a platform for users to freely express their thoughts and provides an opportunity to interact with geographically dispersed likeminded individuals. These social media users discuss a wide variety of topics ranging from ordinary details of their daily life to information about infectious diseases of public health interest like Ebola 2 . Due to the popularity and ubiquitous nature of social media, researchers advocate for utilizing social media for public health applications 3 – 5 . Public health agencies are in an early adoption stage of using social media for information distribution 6 . In addition to the substantial potential for using social media as a disease surveillance tool 3 – 5 and means of information distribution 6 , social media also has the potential to provide other opportunities to improve public-health practice.

Studying the reactions or opinions of a population has traditionally involved nationally distributed data collection, such as surveys from government agencies. However, these methods are expensive, and perhaps more importantly, time consuming. Some researchers suggest that mining social media data can provide opportunities to reduce time and expense when understanding the reactions or opinions of a population on health issues 7 – 10 . For example, social media allows for accessing first person accounts of experiences 7 , 8 , public sentiments 9 , public knowledge 10 , and public attitudes 10 that may help public health agencies and researchers to develop policies that improve public health outcomes. Moreover, social media can provide the contextual information and prevalence of public interests more efficiently than traditional public health methods. Tracking the prevalence of public interests and understanding the general public reactions and opinions on various health issues have the potential to expand the scope of public-health practice.

In this paper, we report on findings derived from social media data gathered from Reddit for the purpose of tracking the prevalence of public interests and understanding public reactions towards infectious diseases of public health interests like Ebola and influenza as well as controversial health issues, such as electronic cigarettes and marijuana. In fact, although Reddit is one of the most popular public social media platforms, it has been underutilized for public health applications. Reddit’s size and range of topics make it difficult to make use of the data without any knowledge of how the platform is used in practice. Thus, we aim to fill this gap in the literature with the current study and answer the following two research questions (RQ):

  • (RQ1) Is Reddit an effective source for tracking the prevalence of public interests on infectious diseases (i.e., Ebola and influenza) and controversial health related issues (i.e., electronic cigarette and marijuana) over time?
  • (RQ2) What do Reddit members discuss regarding these health issues (a) in times of elevated discussion volume or (b) in general, if the issues have a steady level of discussions?

The work described in this paper was exempted from review by the University of Utah’s Institutional Review Board (IRB) [ethics committee] (IRB 00076188).

A growing body of research has demonstrated the successful use of social media for public health applications 11 – 13 . Often referred to as digital disease detection 3 , Infoveillance 4 , and digital epidemiology 5 , many studies have used Twitter data for applications in public health, primarily due to the real-time nature of the data. For example, Twitter data have been used to monitor or estimate influenza 12 , 14 , seasonal allergies 12 , alcohol sales and consumption 15 , cholera outbreaks 16 , earthquake 17 , and smoking behavior 18 , as well as to examine sentiment towards marijuana use 19 . Although Twitter is highly popular and tweet analysis has performed well with the aforementioned topics, tweets provides relatively limited context due to a length limitation of 140 characters.

Other social media data, such as Facebook and online health community data, have also been mined to, for example, characterize and predict postpartum depression 20 , classify opioid addiction phrases 21 and predict adverse drug reactions 22 . Google search queries allowed researchers to provide timely estimation of influenza rates 23 . However, a previous study suggested that Facebook users are reluctant to discuss certain negative topics on Facebook, due to users’ desire to convey positive images of themselves 24 . Online health communities can provide rich details of first person accounts of experiences 25 , however, online health communities typically are single topic focused groups, often with a small number of members and attracting a substantial number of “lurkers” 26 (i.e., individuals who participate without posting) and dropouts 27 . Google search queries can be useful and timely, however, search queries are relatively limited in providing context and have been shown to overestimate disease rates, due (in part) to heightened media coverage 28 .

Recently, Reddit, due to the availability of a public Application Programming Interface (API) 29 , the capability of providing contextual information, and the support for throwaway accounts, has become a widely studied social media platform for controversial discussions. For example, using Reddit data, researchers have found empirical evidence that Reddit members openly discuss and exchange information support for potentially stigmatized issues like mental health illnesses 30 , detected increases in suicidal content following reports of several celebrity suicides 31 , identified distinct markers of shifts to suicidal ideation from mental illnesses 32 , explored the relationship between social feedback and community participation 33 , identified distinctive linguistic characteristics that are associated with mental illnesses 34 , characterized smoking and drinking problems 35 , and examined user experiences with different tobacco products 36 . Thus, in this study, we explore Reddit’s utility as a data source for public health applications for tracking and understanding public opinions and reactions to health issues.

Data: Social Media Site

The data for this study is hosted in the popular social media platform, Reddit ( http://www.reddit.com ). We use Reddit to track and understand discussions of Ebola, influenza, electronic cigarettes, and marijuana for the following three reasons. First, Reddit is a highly active social media platform that had 83 billion page views from over 88,000 active sub-communities (subreddits) in 2015. Members of Reddit made over 73 million individual posts with over 725 million associated comments in the same year 37 . Second, Reddit allows for throwaway and unidentifiable accounts that are suitable for controversial discussions, such as thoughts and feelings on electronic cigarette and marijuana as well as epidemic concerns like Ebola and influenza that may be inappropriate or sensitive for identifiable accounts. Third, Reddit content is publicly available, in contrast to other health focused social media platforms like Facebook Groups or specifically health-focused online communities like PatientsLikeMe, where the content is typically not available on the open web.

Reddit members converse via a forum like platform. Reddit discussion consists of posts (i.e., a submission that starts a conversation) and associated comments (i.e., a submission that replies to posts or other comments) in various topically focused subreddits. Members who have achieved a certain status within the community are able to create new subreddits. For this study, we used a dataset 38 released by a Reddit member. The dataset has been used in previous studies 34 , 39 , 40 . The dataset for the current study is comprised of 239,772 (including both active and inactive) subreddits, 13,213,173 unique member IDs, 114,320,798 posts, and 1,659,361,605 associated comments that were made from October 2007 to May 2015.

RQ1. Is Reddit an effective source for tracking the prevalence of public interests on infectious diseases and controversial health related issues over time?

We used a lexicon-based approach to track discussions on Ebola, electronic cigarettes, influenza, and marijuana from all subreddits available in Reddit. First, we identified key terms associated with the topics of our interests. A summary of key terms for each issue is shown in Table 1 . Second, we preprocessed the entire dataset, which included converting text to lower case and removing punctuation. Third, to extract submissions (i.e., posts and comments) containing key terms from all available 239,772 subreddits, we employed a lexicon-based approach and extracted timestamps, comment or post IDs, member IDs, and subreddit IDs of the submissions. We extracted and included any partial matches in this process to cover a wide variation of terms. For example, a partial match of ‘cig’ can cover a variation of ‘cig’, ‘cigs’, ‘cigarette’, and ‘cigarettes’ for electronic cigarette. Fourth, we counted unique member IDs, subreddits, posts, and comments containing key terms. Fifth, we normalized the frequencies over time by dividing the frequency counts by the total number of the respective variables from all available subreddits for that period. Since the total number of submissions in Reddit generally increases over time, we report normalized frequencies over time counts.

Table 1.

Key terms used in the lexicon-based approach

RQ 2. What do Reddit members discuss on these health issues (a) in times of elevated activities or (b) in general, if the issues have a steady level of discussions?

Based on results of RQ 1, we created two scenarios deciding which time periods to further investigate for understanding the discussions on Ebola, electronic cigarette, influenza, and marijuana. (a) If the issue has a sudden elevated level of discussion, we investigated the time period in which the elevation occurs along with prior discussions of the same temporal length to understand the underlying causes for these sudden changes in public interest. Similar methods that contrast to prior time periods have been used to detect emerging topics 41 , 42 . (b) If the issue has a steady level of discussions, we investigated the entire discussions on the issue to understand the main themes.

We used natural language processing (NLP) and language modeling for this research question. Due to the size of the dataset and range of topics discussed on Reddit, we used automated methods. Similar automated methods have been used in the health care domain to extract information and analyze data, and to enhance the personal health care experiences 43 – 45 . First, we preprocessed the entire dataset as we did in RQ1. Second, to improve the language modeling results, we removed the URLs and comments and posts with less than 5 words, and then extracted nouns using Python Natural Language Toolkit (NLTK) package 46 . The extracted nouns were used to create language models—a set of topics generated from document-level word co-occurrences for a given set of documents—using Latent Dirichlet Allocation 47 (LDA) for the time period of our interests. We elected to use LDA, an unsupervised algorithm, due to the lack of a ground truth dataset. We considered each post and its associated comments as a single document.

One advantage of using LDA as opposed to other unsupervised clustering techniques is that the algorithm considers each document with multiple topics. A previous study of online health discussions suggested that discussions could have multiple topics due to topic drift 48 . Thus, we employed LDA for this study. One disadvantage of using LDA is, however, it requires a pre-determined number of topics. After experimenting with varying numbers of topics, we generated 50 topics to understand Ebola, electronic cigarette, influenza, and marijuana related issues. We used the Python package genism 49 to conduct LDA analysis. We then present the main topics and their top 50 associated words as the word cloud overview using the Python package wordcloud 50 . Despite its simplicity, word cloud overview remains one of the more preferred and user-friendly visualizations that can also scale to different data sizes 51 . We then manually investigated the identified topics and their associated words to thoroughly examine the LDA results.

Lastly, we performed two types of validity checks. First, for health issues with a sudden elevated level of discussion, we verified the LDA results via a systematic analysis of news at the time of the change. LDA results reflect motivations for the extreme changes, thus news can be an effective source for a validity check. Second, we extracted URLs using regular expressions and categorized the results. A previous study concerning electronic cigarettes—a product with few marketing restrictions in the US until recently—suggested that up to 90 percent of social media (in this case, Twitter) content could be related to product marketing 52 . Thus, because marketing content can skew our result, we used URLs as a proxy to marketing content and reported the percentage of posts with URLs. We also manually examined several extracted URLs to ensure the quality of the validation process.

The lexicon-based approach identified Reddit posts, comments, and members discussing Ebola, electronic cigarette, influenza, and marijuana from October 2007 to May 2015 ( Table 2 ). The most discussed matter was influenza, followed by marijuana, electronic cigarettes, and then Ebola. The raw counts of discussions and members who mentioned each topic generally increased with time.

Table 2.

The total number and average normalized count of posts, comments, members, and subreddits identified using the lexicon-based approach

We identified one notable increase in discussion each for Ebloa, influenza, and marijuana using the normalized frequencies over time ( Figure 1 ). First, the normalized count on marijuana almost doubled from the previous month in February 2009. The heightened level of discussions continued for two months then slowly dropped back to the previous level. Second, in April of 2009, the normalized count on influenza almost doubled from the previous month. Third, October 2014 accounts for the Ebola discussions. The discussions on Ebola showed the most increase, jumping more than five times from the previous month. The number of members discussing each issue increased in a similar manner ( Figure 1 ). The Discussions on electronic cigarette was relatively steady from October 2007 to May 2015.

An external file that holds a picture, illustration, etc.
Object name is 2730999f1.jpg

The Line Graphs of normalized frequencies over time for posts and comments with key terms and members who used the key terms

Discussions on Ebola, electronic cigarettes, influenza, and marijuana, however, only accumulated to a fraction of the overall discussions on Reddit ( Table 2 ). Although the community as a whole did not frequently talk about these health-related issues, this still amounted to more than 3,000 members for Ebloa, the least discussed issue, and more than 137,000 members for influenza, the most discussed issue in a month with a normal level of discussion, May 2015.

New subreddits to discuss Ebloa, Electronic Cigarette, Influenza, and Marijuana

Members of Reddit created a number of subreddits specifically focusing on Ebola, electronic cigarettes, influenza, and marijuana, although they also discussed the issues in many different subreddits ( Table 3 ). Using the key terms ( Table 1 ), we detected a total of 450 topically dedicated subreddits that were created between October 2007 and May 2015. For example, marijuana was casually discussed in 18,236 subreddits (i.e., subreddits with key terms in posts or comments), while members created at least 244 subreddits (i.e., key terms in names of subreddits) to talk about marijuana.

Table 3.

Newly created communities dedicated to focus on Ebola, Electronic Cigarette, influenza, and Marijuana

From RQ1, we learned that discussions focusing concerning Ebola, influenza, and marijuana, each had one sudden increase of activities. Thus, we created word cloud overviews of emerging topics for Ebola, influenza, and marijuana, while creating a general word cloud overviews for electronic cigarette ( Figure 2 ).

According to the word cloud overview generated by the LDA topic modeling algorithm, we can infer that Reddit members are most concerned about ‘risk’ and ‘symptoms’ regarding Ebola. For influenza, members used terms like ‘Mexico’, ‘Obama’, ‘CDC’, and ‘conspiracy’, along with H1N1 influenza related terms (e.g., ‘H1N1’, ‘Swine’) as well as H5N1 related terms like ‘Egypt’ and ‘pig’. Topics regarding ‘legalization’, ‘prohibition’, ‘economy’, and ‘state’ appeared in discussions regarding marijuana. The general word cloud overview for electronic cigarettes has more commercially related terms such as ‘quality’, ‘prices’, ‘shop’, and ‘store’ than the other three discussions, however substantially more terms related to tobacco (e.g., ‘tobacco’, ‘cigarette’, ‘cigar’) are shown in Figure 2 . Other notable topics for electronic cigarette that were identified via the LDA were ‘quitting smoking’, ‘fun experience’, and ‘health information’.

The LDA algorithm identified ‘quit’, ‘addiction’, ‘habit’, ‘cravings’, ‘gum’ and ‘turkey’ for ‘quitting smoking’, associated ‘fun’, ‘experience’, ‘safe’, and ‘pleasure’ with ‘fun experience’, and linked ‘cancer’, ‘risk’, ‘study’, ‘evidence’, ‘research’, ‘article’, ‘data’, and ‘science’ with ‘health information’. These topics highlighted a great opportunity for better health education (See Discussion).

To check the validity of the results, we extracted and investigated the URLs to ensure that frequencies are not inflated by marketing content. The types of URLs shared by members were similar in nature for all four issues. Members shared websites that are concerning information (e.g., Wikipedia, CDC), news (e.g., NY Times), personal stories (e.g., blogs), other social media platforms, (e.g., Youtube), different Reddit posts, and commercial resources (e.g., amazon). Although the proportion of each type of URLs is different, members shared a relatively small number of posts and comments with URLs compared to the overall posts and comments focusing on all four issues.

Principal Findings

We examined four different infectious disease related or potentially stigmatized health related issues discussed on Reddit. We discovered three periods with higher levels of activities on Reddit. We observed that there were almost twice as many marijuana related discussions in February 2009 compared to the previous month, due – we suspect – to the announcement of a major shift in federal policy. Attorney General Eric Holder confirmed that Drug Enforcement Administration would halt medical marijuana raids and give states the power to regulate medical marijuana usage for pain control in February of 2009 53 . In April of 2009, discussions about influenza almost doubled from the previous month. This is likely due to the fact that a novel strain of H1N1 influenza virus was discovered in North America in the spring of 2009 54 and the Centers for Disease Control and Prevention (CDC) confirming the first two cases of human infection with H1N1 influenza virus in the United States in April of 2009 55 . On September 30, 2014, the United States had its first diagnosed case of Ebola in Texas, and the first Ebola related death on October 8, 2014 56 . We observed that discussions on Ebola, a potentially fatal infectious disease, surged more than five times from the previous month in October of 2014. The news related to Ebola, influenza, and Marijuana align well with the results from topic model analyses (RQ2). On the basis of these changes of activities, Reddit may be a valuable source of data for tracking the prevalence of public interests on infectious diseases (i.e., Ebola and influenza) and controversial health related issues (i.e., electronic cigarette and marijuana) over time (RQ1).

The result of our analysis on electronic cigarette discussions suggests that Reddit contains more than just commercial content despite the fact there are at least three subreddits focusing on classified content ( Table 4 ). For instance, a subreddit called ‘Ecigclassifieds’ consists mainly of commercial content, thus the content of these subreddits deserves further investigation to better utilize the data. From electronic cigarette discussion, we identified three topics, ‘quitting smoking’, ‘fun experience’, and ‘health information’ that highlighted opportunities for better health education. From their associated terms (see Results), we can infer that Reddit members are seeking information on these three topics. Information seeking behavior on Reddit suggests Reddit’s utility as another social media platform for information distribution and as a data source for understanding user groups (e.g., electronic cigarette smokers) and identifying better health education. Why members are seeking health information on Reddit is an unanswered research question, although a recent study suggests that electronic cigarette related health information from public health agencies may be too difficult for the general public to comprehend 57 .

Table 4.

Posts and comments containing URL s

Reddit members also created at least 450 relevant new subreddits specifically focusing on these four issues. How the content from these subreddits contrast with the content from multiple subreddits on the same issue is an unanswered question. Previous studies 30 , 31 , 33 , 34 , 39 analyzed content from a handful of especially dedicated subreddits for their studies. However, our finding suggests that at least for discussions of Ebola, influenza, electronic cigarettes, and marijuana, members mentioned these issues on thousands of subreddits ( Table 3 ). For instance, a common issue like influenza was discussed in over 30,000 subreddits, and even a focused topic like Ebloa were discussed in over 4,400 subreddits. Thus, we believe analyzing a wider number of subreddits can improve recall of the relevant content.

Limitation, Future Directions, and User Privacy

Reddit offers substantial potential for understanding the public reactions to health-related topics, however, not without a number of limitations. Although Reddit is a widely-used platform, it is more frequently used by young males 58 , 59 and may be subjective to self-selection bias. Reddit members are not necessarily representative of the general public, however, the levels of activity on Reddit aligned with the United States news and deserve a further investigation, especially with respect to location of postings and the overall reactions in Reddit. To better understand the reaction of the general public, studying different platforms and avenues, Facebook and Twitter for example, is warranted. Our analysis suggests that given the increasing popularity and use of Reddit, as well as the increasing frequency of discussions concerning our topic of interests, Reddit provides a productive starting point for investigating infectious disease related or controversial health issues.

Another limitation lies in the methodology. In RQ1, we used a relatively rudimentary lexicon-based approach to extract posts and comments explicit mentioning variations of pre-specified key terms. One major shortcoming of such approach is the selection of key terms. For example, utilizing a large set of key terms will undoubtedly create more false-positives, whereas too limited a set of key terms will surely result in more false-negatives. Moreover, partial matches can produce false-positive matches. We believe the figures for influenza were inflated because ‘flu’ can be a part of a longer word such as ‘fluorine’ or ‘flute’. In future studies, we suggest that precision rather than recall should be emphasized in order to eliminate irrelevant discussions. Other difficulties in mining social media data include the fact that social media text is frequently characterized by extensive of acronyms, abbreviations, and slang terms 60 . Although we included the most frequently found abbreviations and slang terms, lexicon-based approaches are to omit unknown forms of abbreviations and slang. More sophisticated methods utilizing knowledge- based 61 or corpus-based 62 approaches could produce different results. Furthermore, a smaller timeframe can better measure the timeliness of the observed reactions as oppose to the one month timeframe used in RQ1. In RQ2, we relied on a systematic analysis of the news to verify the result of our investigation. However, data driven qualitative analysis 63 can further bolster our findings and provide the contextual information on the discussions of our interests. Sentiment analysis on the extracted discussion can also provide further clues about general public reactions on various health related topics 9 .

Research and applications using social media data should be highly sensitive to user privacy, especially for potentially stigmatized topics. Although at least some social media data are publicly available, researchers should consider ethical implications when processing data even for population-level social media research using public data 64 – 66 . For this reason, we have refrained from using direct quotations from Reddit users in this paper.

As evident by the frequencies over time of discussions, inflated discussions after major news, as well as newly created subreddits specifically focusing on these health-related issues, Reddit could be a useful platform for understanding the concerns and opinions of the general public, especially for issues focusing on controversial topics, such as abuse and addiction as well as infectious diseases of public health interest. By utilizing the content, we also identified opportunities for better health education that could improve public health outcomes. We created topic models using LDA and generated topically associated words and created word cloud visualizations to show (1) emerging topics by contrasting to the prior topic models or (2) main themes of the discussions. We believe our insights and analyses can be generalized to other similar health related issues in the Reddit platform. Understanding public reactions to these issues has the potential to expand the scope of public-health practice.

An external file that holds a picture, illustration, etc.
Object name is 2730999f2.jpg

Word cloud overviews of emerging topics for Ebola (top left), influenza (top right), and marijuana (bottom left) as well as general word cloud overviews for electronic cigarette (bottom right)

Acknowledgments

We restricted our analysis to publicly available discussion content. The study was exempted from review by the University of Utah’s Institutional Review Board (Ethics Committee) [IRB 00076188].

Author AP was funded by National Library of Medicine of the National Institutes of Health under award number T15 LM007124. Author MC’s contribution to this research was supported by National Library of Medicine of the National Institutes of Health under award numbers R00LM011393 & K99LM011393.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Advertisement

Source of the sun’s magnetic field may hide right under its surface

Scientists have long thought the source of the sun's magnetic field sat deep within the star, but it may exist in a far more convenient spot for us to observe it

By Leah Crane

22 May 2024

This illustration lays a depiction of the sun's magnetic fields over an image captured by NASA?s Solar Dynamics Observatory. The complex overlay of lines can teach scientists about the ways the sun's magnetism changes in response to the constant movement on and inside the sun. CREDIT NASA/SDO/AIA/LMSAL http://dx.doi.org/10.1038/s41586-024-07315-1

An illustration of the sun’s magnetic field

NASA/SDO/AIA/LMSAL

The sun’s magnetic field may not be as deep as we thought. For decades, scientists thought the sun’s dynamo – the area that generates its powerful magnetic field – was located far within the star. Now, evidence suggests the dynamo lurks just under the sun’s surface.

The strength of the sun’s magnetic field fluctuates in a distinct 11-year cycle. During the strongest part of the cycle, sunspots and powerful winds emerge near the solar equator, along with the plumes of material that cause the aurora borealis on Earth. Ideas for how the magnetic field is generated have had a difficult time explaining how all of those phenomena are connected.

Essentially, the sun behaves like a giant clock, with the many eddies and flows of plasma within it acting as the gears that make it tick, says Geoffrey Vasil at the University of Edinburgh in the UK. “Nobody really knows how those things fit together or even what they all are, and you can’t explain the whole clock if you don’t know how it starts.”

Is there an ancient black hole at the edge of the solar system?

Vasil and his colleagues suggest that the sun’s magnetic field might stem from instability in the rotation of plasma inside the star, which is common in other astrophysical objects like the discs of hot matter orbiting some black holes. Such instability may occur in the outermost 5 to 10 per cent of the sun.

The researchers modelled how this instability would churn the plasma that makes up the outer layers of the sun. It may give rise to sunspots and create the powerful winds that whip around the sun during its period of maximum activity, they found, along with other magnetic phenomena. Simulations with a dynamo close to the surface matched observed magnetic patterns on the sun much more closely than those with a deep dynamo.

Sign up to our Launchpad newsletter

Voyage across the galaxy and beyond with our space newsletter every month.

“There are all of these clues, and we’ve been piecing these things together for nearly 20 years,” says Vasil. “It’s very satisfying to have lots of things fit into place and make a lot of sense.”

If the sun’s dynamo is generated near its surface, that could make it much easier to study the solar magnetic field and predict its behaviour. “If the magnetic fields are sitting there, then there is the most hope for actually being able to study them,” says Vasil.

This could allow us to better forecast the solar activity that spawn stunning aurorae – and mess with electrical grids on Earth.

Journal reference:

Nature DOI: 10.1038/s41586-024-07315-1

  • electromagnetism /

Sign up to our weekly newsletter

Receive a weekly dose of discovery in your inbox! We'll also keep you up to date with New Scientist events and special offers.

More from New Scientist

Explore the latest news, articles and features

Tiny black holes hiding in the sun could trace out stunning patterns

Subscriber-only

How to see tonight's northern lights – the strongest in 20 years

Mars is blasting plasma out of its atmosphere into space, kill the sun how wild thought experiments drive scientific discovery, popular articles.

Trending New Scientist articles

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

Artificial brain surgery —

Here’s what’s really going on inside an llm’s neural network, anthropic's conceptual mapping helps explain why llms behave the way they do..

Kyle Orland - May 22, 2024 6:31 pm UTC

Here’s what’s really going on inside an LLM’s neural network

Further Reading

Now, new research from Anthropic offers a new window into what's going on inside the Claude LLM's "black box." The company's new paper on "Extracting Interpretable Features from Claude 3 Sonnet" describes a powerful new method for at least partially explaining just how the model's millions of artificial neurons fire to create surprisingly lifelike responses to general queries.

Opening the hood

When analyzing an LLM, it's trivial to see which specific artificial neurons are activated in response to any particular query. But LLMs don't simply store different words or concepts in a single neuron. Instead, as Anthropic's researchers explain, "it turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts."

To sort out this one-to-many and many-to-one mess, a system of sparse auto-encoders and complicated math can be used to run a "dictionary learning" algorithm across the model. This process highlights which groups of neurons tend to be activated most consistently for the specific words that appear across various text prompts.

The same internal LLM

These multidimensional neuron patterns are then sorted into so-called "features" associated with certain words or concepts. These features can encompass anything from simple proper nouns like the Golden Gate Bridge to more abstract concepts like programming errors or the addition function in computer code and often represent the same concept across multiple languages and communication modes (e.g., text and images).

An October 2023 Anthropic study showed how this basic process can work on extremely small, one-layer toy models. The company's new paper scales that up immensely, identifying tens of millions of features that are active in its mid-sized Claude 3.0 Sonnet model. The resulting feature map—which you can partially explore —creates "a rough conceptual map of [Claude's] internal states halfway through its computation" and shows "a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities," the researchers write. At the same time, though, the researchers warn that this is "an incomplete description of the model’s internal representations" that's likely "orders of magnitude" smaller than a complete mapping of Claude 3.

A simplified map shows some of the concepts that are "near" the "inner conflict" feature in Anthropic's Claude model.

Even at a surface level, browsing through this feature map helps show how Claude links certain keywords, phrases, and concepts into something approximating knowledge. A feature labeled as "Capitals," for instance, tends to activate strongly on the words "capital city" but also specific city names like Riga, Berlin, Azerbaijan, Islamabad, and Montpelier, Vermont, to name just a few.

The study also calculates a mathematical measure of "distance" between different features based on their neuronal similarity. The resulting "feature neighborhoods" found by this process are "often organized in geometrically related clusters that share a semantic relationship," the researchers write, showing that "the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity." The Golden Gate Bridge feature, for instance, is relatively "close" to features describing "Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo ."

Some of the most important features involved in answering a query about the capital of Kobe Bryant's team's state.

Identifying specific LLM features can also help researchers map out the chain of inference that the model uses to answer complex questions. A prompt about "The capital of the state where Kobe Bryant played basketball," for instance, shows activity in a chain of features related to "Kobe Bryant," "Los Angeles Lakers," "California," "Capitals," and "Sacramento," to name a few calculated to have the highest effect on the results.

reader comments

Promoted comments.

research articles on reddit

We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.

Channel Ars Technica

  • My View My View
  • Following Following
  • Saved Saved

Lam Research unveils $10 billion buyback, 10-for-1 stock split

  • Medium Text

Sign up here.

Reporting by Yuvraj Malik in Bengaluru; Editing by Shilpi Majumdar

Our Standards: The Thomson Reuters Trust Principles. New Tab , opens new tab

Russia's largest mobile operator MTS has extended the deadline on a proposed buyback for foreign shareholders by three weeks to June 18, it said on Monday of a deal that could allow Western investors to recoup some funds stranded in Russia.

Malaysia's sovereign wealth fund said its consortium partner, Global Infrastructure Partners (GIP), will not hire staff to directly manage Malaysia Airport Holdings Bhd after a deal is completed to take the country's airport operator private, state news agency Bernama reported.

Australian developer Lendlease will retreat from its overseas construction businesses and free up to A$4.5 billion ($2.9 billion) in capital for shareholders, putting a lid on its international ambitions to shift focus on local operations.

Crude oil production in Kazakhstan

Markets Chevron

Tokyo Stock Exchange holds a ceremony marking the end of trading in 2022

No Reuters global markets report in U.S. hours on May 27

There will be no global markets report during U.S. hours on Monday, May 27, as markets are closed for a public holiday.

The facade of the original Toronto Stock Exchange building is seen in Toronto

  • Share full article

Advertisement

Supported by

Why a New Yorker Story on a Notorious Murder Case Is Blocked in Britain

The article challenges the evidence used to convict Lucy Letby, a neonatal nurse, of multiple murders last year, and has led to a debate about England’s restrictions on trial reporting.

A large television screen broadcasts a woman’s picture as a man looks on near a camera and other equipment.

By The New York Times

The New Yorker magazine published a 13,000-word article on Monday about one of Britain’s biggest recent criminal trials, that of the neonatal nurse Lucy Letby, who was convicted last year of the murder of seven babies .

The article, by the staff writer Rachel Aviv, poses substantial questions about the evidence relied on in court. And it raises the possibility that Ms. Letby, vilified in the media after her conviction, may be the victim of a grave miscarriage of justice.

But, to the consternation of many readers in Britain, the article can’t be opened on a regular browser there, and most news outlets available in Britain aren’t describing what is in it.

The New Yorker deliberately blocked the article from readers in Britain because of strict reporting restrictions that apply to live court cases in England. A publication that flouts those rules risks being held “in contempt of court,” which can be punished with a fine or prison sentence.

Neither The New Yorker nor its parent company, Condé Nast, responded to requests for comment on Thursday. Earlier in the week, a spokesperson for the magazine told Press Gazette , the British trade publication, “To comply with a court order restricting press coverage of Lucy Letby’s ongoing trial, The New Yorker has limited access to Rachel Aviv’s article for readers in the United Kingdom.”

Under English law, restrictions apply to the reporting of live court proceedings, to prevent a jury’s being influenced by anything outside the court hearing. After Ms. Letby’s sentencing in August last year, those restrictions were lifted. But they were reimposed in September, when the public prosecutor for England and Wales announced that it would seek a retrial on one charge of attempted murder on which the jury had not been able to reach a verdict. “There should be no reporting, commentary or sharing of information online which could in any way prejudice these proceedings,” the prosecutor stated. The retrial is set to begin in June.

Ms. Letby has requested permission to appeal her convictions. After a three-day hearing last month, a panel of judges at the Court of Appeal said it would deliver a decision on that request at a later date .

In Britain, those trying to read the New Yorker article on internet browsers are greeted by an error message: “Oops. Our apologies. This is, almost certainly, not the page you were looking for.” But the block is not comprehensive: The article can be read in the printed edition, which is available in stores in Britain, and on The New Yorker’s mobile app.

The questions about its availability in Britain have prompted a debate around England’s reporting restrictions, their effectiveness and their role in the justice system.

Speaking in Parliament on Tuesday, David Davis, a Conservative Party lawmaker and former cabinet minister, questioned whether the restricting of reporting might, in this instance, undermine the principle of open justice, which allows the public to scrutinize and understand the workings of the law.

“The article was blocked from publication on the U.K. internet, I understand because of a court order,” Mr. Davis said. “I am sure that court order was well intended, but it seems to me that it is in defiance of open justice.”

He was able to raise the issue because he has legal protection for comments made in the House of Commons under what is known as parliamentary privilege . Media organizations have a more limited form of protection, known as qualified privilege, to accurately report what is said in Parliament.

In his response to the question from Mr. Davis, Alex Chalk, the justice secretary, said: “Court orders must be obeyed, and a person can apply to the court for them to be removed. That will need to take place in the normal course of events.”

Mr. Chalk added: “On the Lucy Letby case, I simply make the point that juries’ verdicts must be respected. If there are grounds for an appeal, that should take place in the normal way.”

COMMENTS

  1. What is the best way to find research papers?

    The NBER is a nonprofit research organization that publishes top scholarship in the economics discipline. Many important articles first appear in working paper form here, and much of the scholarship has a broad, public policy focus. RAND Corporation. Non-partisan think tank that produces a wealth of information on social science topics.

  2. Scholar

    Access to a research article [Article] ... If we find cases of digital piracy, it will be removed and reported to the Reddit admins regardless of the intent. Note: Requests or offers of financial compensation through PMs or chats are beyond our scope of moderation and should be reported to the Reddit admins directly.

  3. What is the best website to find research papers? : r/AskAcademia

    Google Scholar - it's definitely a classic go-to for finding research papers! I like the fact that I can set up email alerts on Google Scholar whenever a new paper comes out related to the topic of my interest. I hope this helps, and feel free to share some websites with me in case you come across any~. 2. Reply.

  4. Studying Reddit: A Systematic Overview of Disciplines, Approaches

    This article offers a systematic analysis of 727 manuscripts that used Reddit as a data source, published between 2010 and 2020. Our analysis reveals the increasing growth in use of Reddit as a data source, the range of disciplines this research is occurring in, how researchers are getting access to Reddit data, the characteristics of the datasets researchers are using, the subreddits and ...

  5. Full article: 'Scraping' Reddit posts for academic research? Addressing

    Importantly, this article has largely dealt with some of the possible ethical pitfalls of conducting online research using Reddit. However, it must also be clearly acknowledged that all the studies discussed in this article investigate timely and relevant research questions, and present relevant and beneficial conclusions that lend to further ...

  6. How scientists are making the most of Reddit

    She resolved to jump into the comments and clear things up, and this was the start of her science-communication career. Since then, Cendes has made a name for herself on Reddit and even created ...

  7. Trends and challenges within Reddit and health communication research

    Researchers have suggested that Reddit may be viewed as a social phenomenon itself in that what health topics are actively being studied depends on the site's user engagement and current global health challenges (Jones, 2019).Thus, the rise of health communication scholarship on Reddit demonstrates that this is a growing field for future research.

  8. Studies of Depression and Anxiety Using Reddit as a Data Source

    Approach to this research area with a scoping review is supported by 2 points of rationale. First, no work has been done to synthesize the existing research on depression and anxiety using Reddit outside of a small selection of review articles that included Reddit-focused studies under broader topics of social media data and mental health [3-5 ...

  9. Reddit

    ABSTRACT: This article offers a systematic analysis of 727 manuscripts that used Reddit as a data source, published between 2010 and 2020. Our analysis reveals the increasing growth in use of Reddit as a data source, the range of disciplines this research is occurring in, how researchers are getting access to Reddit data, the characteristics of the datasets researchers are using, the ...

  10. The Anatomy of Reddit: An Overview of Academic Research

    2 The Reddit Dataset. Reddit (launched in 2005) is a social news aggregation, web content rating and discussion website, ranked as #6 most visited website in the world with 234 million unique users (as of February 2018). 1 A schematic structure of Reddit is illustrated in Fig. 1.

  11. I made a list of Academic research websites, I hope you find ...

    Top Academic Search Engines and Academic Research Websites for Students. Every student's nightmare is not finding the information he or she needs for the research paper or assignment. Google does a good job but for academic research, there are great sites where you can find more information about your topic you are working on in any field.

  12. Disguising Reddit sources and the efficacy of ethical research

    Given the public prominence, breadth, and depth of Reddit's content, researchers use it as a data source. Proferes et al., ( 2021) identified 727 such studies published between 2010 and 2020-May. They found that only 2.5% of their studies claimed to paraphrase compared to the 28.5% of the studies that used exact quotes.

  13. Studying Reddit: A Systematic Overview of Disciplines, Approaches

    Abstract. This article offers a systematic analysis of 727 manuscripts that used Reddit as a data source, published between 2010 and 2020. Our analysis reveals the increasing growth in use of Reddit as a data source, the range of disciplines this research is occurring in, how researchers are getting access to Reddit data, the characteristics of ...

  14. Reddit 101 for Scientists

    Reddit uses something called flair to designate who is a trained scientist, doctor, or engineer. The flair will present as a small bar next to your user name, noting your title and/or education level (such a Professor of Biology, PhD, etc.). When you add this bit of information people will understand that the comments you provide are ...

  15. Studying Reddit: A Systematic Overview of Disciplines, Approaches

    Abstract. This article offers a systematic analysis of 727 manuscripts that used Reddit as a data source, published between 2010 and. 2020. Our analysis reveals the increasing growth in use of ...

  16. New Data Sources in Social Science Research: Things to Know Before

    Specifically, we provide descriptive information about the Reddit site and its users, tips for using organic data from Reddit for social science research, some ideas for conducting a survey on Reddit, and lessons learned in merging survey responses with Reddit posts. While this article is specific to Reddit, researchers may also view it as a ...

  17. Tracking Health Related Discussions on Reddit for Public Health

    Background. A growing body of research has demonstrated the successful use of social media for public health applications 11-13.Often referred to as digital disease detection 3, Infoveillance 4, and digital epidemiology 5, many studies have used Twitter data for applications in public health, primarily due to the real-time nature of the data.For example, Twitter data have been used to ...

  18. Unpaywall

    An open database of 50,227,918 free scholarly articles. We harvest Open Access content from over 50,000 publishers and repositories, and make it easy to find, track, and use. Get the extension "Unpaywall is transforming Open Science" —Nature feature ... Libraries Enterprise Research.

  19. People who write scientific research papers quickly, what's ...

    This reddit is intended for academic philosophers - (graduate) students, teachers, and researchers. Encouraged submissions: Open access articles of merit and substance, including from the popular press, that directly engage with a philosophical issue or concern the philosophical academic community. Links to teaching resources also appreciated.

  20. Why I pursued interdisciplinary research as an aspiring academic ...

    Weaving together understanding from multiple fields made it possible for me to do research I loved. Eventually it came time to ap[1]ply for academic jobs, and some of my old fears resurfaced. Very few job postings explicitly sought an interdisciplinary scholar, so I found myself applying for jobs with discipline-specific requirements that often ...

  21. Source of the sun's magnetic field may hide right under its surface

    Explore the latest news, articles and features Space Tiny black holes hiding in the sun could trace out stunning patterns News. Subscriber-only. Space How to see tonight's northern lights - the ...

  22. Here's what's really going on inside an LLM's neural network

    Now, new research from Anthropic offers a new window into what's going on inside the Claude LLM's "black box." ... However, this article is missing the most bemusing part of this project, where ...

  23. Lam Research unveils $10 billion buyback, 10-for-1 stock split

    Lam Research's board has approved a 10-for-1 stock split and share buyback worth up to $10 billion, the chip-making equipment firm said on Tuesday, amid signs that its business was benefiting from ...

  24. Evaluating Reddit as a Crowdsourcing Platform for Psychology Research

    Research article. First published online May 31, 2021. Evaluating Reddit as a Crowdsourcing Platform for Psychology Research Projects. ... In addition, researchers have the option to provide compensation to Reddit participants through flexible compensation methods, such as gift card raffles, which have the benefit of minimizing financial losses ...

  25. How to get Scientific Papers for free : r/coolguides

    How to get Scientific Papers for free. Asked the fella, a scientist who had boots on the ground at the beginning of the OA movement, if this was a good guide. He replied in his usual, blunt, rabble-rousing manner: "Yes and no. This is what a brave librarian will tell you -- only using Sci-Hub as a last resort.

  26. Why a New Yorker Story on a Notorious Murder Case Is Blocked in Britain

    The New Yorker magazine published a 13,000-word article on Monday about one of Britain's biggest recent criminal trials, that of the neonatal nurse Lucy Letby, who was convicted last year of the ...

  27. An odd question: Can you recommend fun, interesting academic ...

    For example, I use this topic about a satellite that doesn't do anything other than being there, but I want academic journal articles about engineering (chemical and civil), biology (including genetics), chemistry, sociology, psychology where we can read an academic journal article instead of Scientific American, Psychology Today, or other not ...

  28. Finding research articles about critical periods : r ...

    The research articles I need need to be an example of a study done that shows how critical periods affect psychological development My paper that I'm writing is about why parents need to teach their kids skills because if they don't learn these skills they'll never learn (e.g language, social skills, stuff like that)